hpe ddr4 server memory performance in hpe proliant and …...16 to 24 channels in the four-socket...

16
HPE DDR4 server memory performance in HPE ProLiant and HPE Synergy Gen10 servers with Intel Xeon Scalable processors Technical white paper

Upload: others

Post on 25-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HPE DDR4 server memory performance in HPE ProLiant and …...16 to 24 channels in the four-socket servers. In both cases, the maximum DIMM data rate increased from 2400 MT/s to 2666

HPE DDR4 server memory performance in HPE ProLiant and HPE Synergy Gen10 servers with Intel Xeon Scalable processors

Technical white paper

Page 2: HPE DDR4 server memory performance in HPE ProLiant and …...16 to 24 channels in the four-socket servers. In both cases, the maximum DIMM data rate increased from 2400 MT/s to 2666

Technical white paper

Contents Introduction ........................................................................................................................................................................................................................................................................................................................................ 3

Optimizing memory configurations .............................................................................................................................................................................................................................................................................. 3

Prohibited configuration .................................................................................................................................................................................................................................................................................................. 3

Understanding unbalanced memory configurations ............................................................................................................................................................................................................................. 4

Optimizing for capacity ..................................................................................................................................................................................................................................................................................................... 5

Optimizing for performance.......................................................................................................................................................................................................................................................................................... 5

Optimizing for lowest power consumption ..................................................................................................................................................................................................................................................... 6

ROM-Based Setup Utility settings ................................................................................................................................................................................................................................................................................. 6

Optimizing for resiliency .................................................................................................................................................................................................................................................................................................. 8

Optimizing for power .......................................................................................................................................................................................................................................................................................................... 9

Settings for memory operation ...............................................................................................................................................................................................................................................................................10

Impact of CPU SKUs on memory performance ..............................................................................................................................................................................................................................................10

Core count..................................................................................................................................................................................................................................................................................................................................10

Maximum DDR4 data rate ..........................................................................................................................................................................................................................................................................................13

Conclusion ........................................................................................................................................................................................................................................................................................................................................13

Appendix A ......................................................................................................................................................................................................................................................................................................................................14

Appendix B ......................................................................................................................................................................................................................................................................................................................................15

Page 3: HPE DDR4 server memory performance in HPE ProLiant and …...16 to 24 channels in the four-socket servers. In both cases, the maximum DIMM data rate increased from 2400 MT/s to 2666

Technical white paper Page 3

Introduction The HPE ProLiant and Synergy Gen10 servers introduced significant memory performance advantages over their Gen9 counterparts. The Intel® Skylake CPUs used in HPE Gen10 servers increased the number of memory channels from 8 to 12 in the two-socket server and from 16 to 24 channels in the four-socket servers. In both cases, the maximum DIMM data rate increased from 2400 MT/s to 2666 MT/s. The increase in memory channels and data rate contributed to a 66% growth in DDR4 memory bandwidth from Gen9 to Gen10 servers. With the addition of 2933 MT/s SmartMemory, recently introduced with Intel Cascade Lake-based processors, this grows to an 81% improvement over Gen9 memory. At the same time, many Gen10 servers continue to support two DIMMs per channel (2DPC) at the maximum allowable data rate of 2933 MT/s.

The Gen10 ROM Based Setup Utility (RBSU) made it easier to deploy the servers by introducing Workload Profiles for users to optimize the CPU, Memory, and I/O configurations with a single menu selection based on specific workloads such as high-performance compute (HPC), database, and virtualization. While RBSU Workload Profiles significantly simplifies configuring HPE ProLiant servers, RBSU continues to enable advanced users to individually configure specific features in the CPU, memory, and I/O subsections.

Hewlett Packard Enterprise continues to increase the resiliency of the HPE ProLiant and Synergy Gen10 servers that use the Intel® Xeon® Scalable family of processors by introducing HPE Fast Fault Tolerance. This memory reliability, accessibility, and serviceability (RAS) feature, the result of a joint collaboration between Intel and HPE, allows the server to continue running with a very small impact to memory throughput when up to two DRAMs fail on a DIMM. In previous generations, this feature required the system to boot up with a 50% reduction in memory throughput in anticipation of multiple DRAM failures. In the Gen10 servers, there is no performance loss until one DRAM fails, and even then, the reduction in maximum throughput is small.

Optimizing memory configurations By taking advantage of the different DIMM types and capacities available for HPE ProLiant and Synergy Gen10 servers, you can optimize your server memory configuration to meet different application or data center requirements.

Prohibited configuration As important as memory performance may be in a server application, there is an even more critical consideration. The memory configuration must be legal in order for the server to boot. Unfortunately, across all of the possible memory configurations in HPE ProLiant Gen10 servers, some combinations just don’t work and should be avoided.

No UDIMMs HPE ProLiant Gen10 servers based on the Intel Scalable performance family of processors do not support unregistered DIMMs (UDIMMs). If even one UDIMM is installed in one of these servers, the server will not boot. Only registered DIMMs (RDIMMs) and load-reduced DIMMs (LRDIMMs) are supported on these servers.

No mixing of RDIMMs and LRDIMMs Even though they are supported on HPE ProLiant and HPE Synergy Gen10 servers, RDIMMs and LRDIMMs may not be mixed within a server. If these DIMMs are mixed in any way in a server, the server will not boot.

No mixing of quad rank and octal rank LRDIMMs Even though they are both technically LRDIMMs, the underlying architectures of quad rank LRDIMMs and octal rank LRDIMMs are functionally incompatible with each other. Due to this, these DIMMs may not be mixed within an HPE ProLiant or HPE Synergy Gen10 server. If these DIMMs are mixed in a server, the server will not boot.

2666 MT/s server memory All systems purchased with the updated Intel Scalable Performance processors will ship with 2933 MT/s memory; however, in the case of a CPU upgrade to an existing system, 2666 MT/s memory is fully supported. In these cases, the memory operating speed will be limited to 2666 MT/s and is also a factor that should be considered when evaluating memory performance. If 2933 MT/s speed is desired along with a CPU upgrade, then the 2666 MT/s memory will need to be replaced with 2933 MT/s memory as well.

RAS feature requirements In Gen10 servers, there are several memory RAS features that may be enabled to improve overall reliability of the server system. Each one has certain population requirements in order to function properly. If these RAS features are desired, then the first step is to meet the loading requirement of the RAS feature. Performance optimization may have to be sacrificed in order to enable RAS feature functionality.

Page 4: HPE DDR4 server memory performance in HPE ProLiant and …...16 to 24 channels in the four-socket servers. In both cases, the maximum DIMM data rate increased from 2400 MT/s to 2666

Technical white paper Page 4

Understanding unbalanced memory configurations Unbalanced memory configurations are those in which the installed memory is not distributed evenly across the memory channels and/or processors. HPE discourages unbalanced configurations because they will always have lower performance than similar balanced configurations. There are two types of unbalanced configurations, each with its own performance implications.

• Unbalanced across channels: A memory configuration is unbalanced across channels if the memory capacities installed on each of the six channels of each installed processor are not identical.

• Unbalanced across processors: A memory configuration is unbalanced across processors if a different amount of memory is installed on each of the processors.

Memory configurations that are unbalanced across channels In unbalanced memory configurations across channels, the memory controller will split memory up into regions, as shown in Figure 1. Each region of memory will have different performance characteristics. The memory controller groups memory across channels, where possible, to create the regions. It will first attempt to create regions that span all six memory channels since these have the highest performance. Next, it will move to create regions that span four memory channels and then three followed by two and then just one.

CPU

Memory Controller 1

Memory Channels

Region 2(Non-Interleaved)

Region 1(Interleaved)

Memory Controller 2

Figure 1. A memory configuration that is unbalanced across memory channels

The primary effect of memory configurations that are not balanced across channels is a decrease in memory throughput in those regions that span fewer memory channels. In the previously mentioned example, measured memory throughput in Region 2 may be as little as 16.6% of the throughput in Region 1.

When it comes down to it, a balanced configuration optimizes interleaving. An unbalanced configuration limits the ability of the memory controller to interleave memory accesses and optimize performance.

Memory configurations that are unbalanced across processors Figure 2 shows a memory configuration that is unbalanced across processors. The CPU1 threads operating on the larger memory capacity of CPU1 may have adequate local memory with relatively low latencies. The CPU2 threads operating on the smaller memory capacity of CPU2 may consume all available memory on CPU2 and request remote memory from CPU1. The longer latencies associated with the remote memory will result in reduced performance of those threads. In practice, this may result in nonuniform performance characteristics for program threads, depending on which processor executes them. In addition to the higher latency when accessing memory on the other CPU, if there are enough threads accessing memory on a different CPU, then the bandwidth of the cross-CPU link itself becomes a limiting factor.

Page 5: HPE DDR4 server memory performance in HPE ProLiant and …...16 to 24 channels in the four-socket servers. In both cases, the maximum DIMM data rate increased from 2400 MT/s to 2666

Technical white paper Page 5

UPILinksCPU

Memory Controller 1

Memory Channels

Memory Controller 2

CPU

Memory Controller 1

Memory Channels

Memory Controller 2

Memory Channels Memory Channels

Figure 2. A memory configuration that is not balanced across processors

Optimizing for capacity You can maximize memory capacity on HPE ProLiant and HPE Synergy Gen10 servers using 128 GB LRDIMMs. With LRDIMMs, you can install up to two octal-ranked DIMMs in a memory channel. On 24-slot servers, you can configure the system with up to 3072 GB of total memory.

Table 1. Maximum memory capacities for HPE ProLiant Gen10 two-socket servers using different DDR4 DIMM types

Number of DIMM slots DIMM type Maximum capacity Configuration

24 RDIMM

LRDIMM

768 GB

3072 GB

24 x 32 GB 2R

24 x 128 GB 8R

16 RDIMM

LRDIMM

512 GB

2048 GB

16 x 32 GB 2R

16 x 128 GB 8R

Mixing different capacity DIMMs in the same channel There are no performance implications for mixing sets of different capacity DIMMs at the same operating speed. For example, latency and throughput will not be negatively impacted by installing twelve 16 GB dual rank DDR4 2933 MT/s DIMMs (1DPC), plus twelve 32 GB dual rank DDR4 2933 MT/s DIMMs (1DPC).

For optimal throughput and latency, populate all six channels of each installed CPU identically.

Optimizing for performance The two primary measurements of memory subsystem performance are throughput and latency. Latency is a measure of the time it takes for the memory subsystem to deliver data to the processor core after the processor makes a request. Throughput measures the total amount of data that the memory subsystem can transfer to the system processors during a given period—usually one second.

Impact of channel population When it comes down to it, every channel that is left unpopulated results in lost bandwidth. For the Intel scalable performance processor family, with six available channels per CPU, each channel left unpopulated results in the loss of one-sixth of the bandwidth the memory system is capable of. In order to maximize memory bandwidth, at least one DIMM should be populated in every memory channel of every CPU that is installed.

Page 6: HPE DDR4 server memory performance in HPE ProLiant and …...16 to 24 channels in the four-socket servers. In both cases, the maximum DIMM data rate increased from 2400 MT/s to 2666

Technical white paper Page 6

Impact of interleaving Memory interleaving is a technique used to maximize memory performance by spreading memory accesses evenly across memory devices. Interleaved memory results in a contiguous memory region across multiple devices with sequential accesses using each memory device in turn, instead of using the same one repeatedly. The result is a more efficient data bus, which results in higher memory throughput because less time is spent waiting for the same memory device to be ready for the next access.

There are several kinds of interleaving in HPE ProLiant and HPE Synergy Gen10 servers. These include rank interleaving, channel interleaving, memory controller interleaving, and node interleaving. Note that in any interleaving configuration, the memory controllers can only interleave across identically configured memory channels and ranks. See the section that discusses balanced configuration for more information.

Impact of single rank and dual rank DIMMs Essentially, the discussion of single versus dual rank DIMMs comes down to rank interleaving. When installing one single rank DIMM on a channel, there is no chance to interleave between ranks on that channel. If installing a single dual rank DIMM on the channel, however, rank interleaving can be used to reduce inefficiencies on the memory channel. Going from no interleaving on a channel to a two-way interleave configuration (such as what happens with a dual rank DIMM) will result in higher throughput and lower latency. For highly sequential memory accesses, a throughout improvement of as much as 10% has been observed.

Impact of number of ranks on the channel Just like with dual rank DIMMs, multiple DIMMs on the channel will also enable rank interleaving. If two single rank DIMMs of the same capacity and configuration are installed on a channel, the overall performance will be almost identical to that achieved by installing a single dual rank DIMM on the channel. If two dual rank DIMMs are installed on the channel, additional performance gains may be achieved as long as the DIMMs are identical. In this case, the channel can interleave sequential accesses across all four ranks of memory resulting in an additional throughput increase beyond what was achieved with only two ranks on the channel.

Impact of RDIMMs and LRDIMMs In addition to the number of ranks on the channel, the construction of the DIMM also has an impact on performance. Of the two main DIMM types, RDIMMs should be selected for highest throughput and lowest latency. LRDIMMs, on the other hand, should be selected when optimizing for high capacity. This is due to the addition of a data buffer on the LRDIMMs that doesn’t exist on the RDIMMs. The addition of the data buffer enables the creation of a higher capacity DIMM but at the cost of additional latency as the data travels through the buffer. In a situation where the memory bus is heavily utilized, this added latency will also result in lower throughput as a side effect.

Optimizing for lowest power consumption Several factors determine the power that a DIMM consumes in a system, including:

• DIMM technology

• DIMM capacity

• Number of DIMM ranks

• DIMM operating speed

Also, power consumption can be reduced by adjusting the number of DIMMs installed. For example, a single 32 GB DIMM will consume less power than two 16 GB DIMMs or four 8 GB DIMMs will. If there are multiple ways to achieve the desired memory capacity, choosing the smaller number of DIMMs may provide for a lower power solution, but at the expense of throughput if channels are left unpopulated.

ROM-Based Setup Utility settings Selecting Workload Profiles The RBSU Workload Profiles represent one of the HPE Intelligent System Tuning features. Each profile optimizes the resources in the HPE ProLiant and HPE Synergy Gen10 servers to match the selected workload. Upon selection, the server automatically configures the BIOS settings in each of the subsystems. Settings that are affected by the Workload Profile selection are grayed out and unchangeable when the subsystem option is selected. The custom profile allows the user to independently configure all BIOS options. Here is the list of subsystems that are optimized for the various Workload Profiles:

• Processor

• Memory

• Virtualization

• Power and performance

Page 7: HPE DDR4 server memory performance in HPE ProLiant and …...16 to 24 channels in the four-socket servers. In both cases, the maximum DIMM data rate increased from 2400 MT/s to 2666

Technical white paper Page 7

See the UEFI Workload-based Performance and Tuning Guide for HPE ProLiant Gen10 Servers and HPE Synergy document for details on how each profile affects the various subsystems.

Figure 3. Selecting the Workload Profile in RBSU

Here is the list of Workload Profiles available in the HPE ProLiant and HPE Synergy Gen10 servers:

1. General Power Efficient Compute (GPEC) (default)

2. General Peak Frequency Compute

3. General Throughput Compute

4. Virtualization—Power Efficient

5. Virtualization—Max Performance

6. Low Latency (LL)

7. Mission Critical

8. Transactional Application Processing

9. HPC

10. Decision Support

11. Graphics Processing

12. I/O Throughput

13. Custom

Figure 4 shows the impact of Workload Profiles and CPU stock keeping units (SKUs) on memory throughput. Workload 1:0 read:write (R:W) (turquoise bars) is read only, while 2:1 R:W (purple bars) is a workload with twice as many reads as writes. The Gold 5222 with only four cores does not have enough cores to generate lots of memory activity. The Gold 5218 with 12 cores is limited by its DDR4 2666 MT/s capability.

Page 8: HPE DDR4 server memory performance in HPE ProLiant and …...16 to 24 channels in the four-socket servers. In both cases, the maximum DIMM data rate increased from 2400 MT/s to 2666

Technical white paper Page 8

Figure 4. Impact of Workload Profiles and CPU SKUs. Performance differences between workload profiles at 2933 MT/s may improve by up to 10%.

Setting memory node interleaving Memory node interleaving intersperses the memory address map across the installed processors. It is disabled by default. Node interleaving may be enabled in the RBSU Memory Options menu.

When disabled, BIOS maps the system memory such that the memory addresses for the DIMMs attached to a given processor are together, or contiguous. For conventional applications, this arrangement is more efficient because each processor will directly access the memory addresses containing the code and data for the programs they are executing. Additionally, modern operating systems are very effective at allocating memory on the CPU node where the process will be run.

When node interleaving is enabled, system memory addresses are alternated, or interleaved, across the DIMMs installed on both processors. In this case, each successive page in the system memory map is physically located on a DIMM attached to a different processor. For a subset of specific workloads—in particular those using shared data sets—the system may operate at a higher level of performance with node interleaving enabled.

Optimizing for resiliency DDR4 DIMMs may be constructed using either 4-bit wide (x4) or 8-bit wide (x8) DRAM chips. Current error-correcting code (ECC) algorithms used in the memory controllers are generally capable of detecting and correcting memory errors up to four bits wide. For DIMMs constructed using x4 DRAMs, an entire DRAM chip on the memory module can fail and the memory controller will detect and correct the errors. On the other hand, these same systems cannot tolerate the failure of a DRAM chip on DIMMs constructed using x8 DRAMs. The ECC algorithm can detect the failure, but it cannot correct it. As a result, systems configured with DIMMs using x4 DRAMs are safer from potential memory failures than those using memory consisting of x8 DRAMs. DDR4 RDIMMs may be constructed with x4 or x8 DRAMs. All DDR4 LRDIMMs use x4 DRAMs only.

2x Refresh Dynamic random access memory (DRAM) uses arrays of capacitors to store a charge to represent a value. For example, a charged capacitor may represent a 1 and a discharged capacitor may represent a 0. A charged capacitor slowly discharges over time, and a fully discharged capacitor may slowly accumulate some charge over time—but not enough to become fully charged. To prevent a state of 1 switching to 0 or vice versa, the memory controller periodically commands the DRAM to sense the charge in the capacitors and then to refresh the charge. That is, if the capacitor was close to being charged, recharge it; and if the capacitor was close to being discharged, then remove any charge. This act of dynamically refreshing the charge is an inherent feature of DRAM, and so it is included in the name.

100000

120000

140000

160000

180000

200000

220000

240000

GPEC LL HPC GPEC LL HPC GPEC LL HPC GPEC LL HPC

Gold 5122 3.6 GHz 4C 2666 Gold 6134 3.2 GHz 8C 2666 Gold 5118 2.3 GHz 12C 2400 Plat 8176 2.1 GHz 28C 2666

Thro

ughp

ut (M

B/s)

Impact of Workload Profiles on memory throughput (HPE DL360 Gen10 server, 32 GB 2Rx4, 2DPC at DDR4-2666)

1:0 R:W 2:1 R:W

Page 9: HPE DDR4 server memory performance in HPE ProLiant and …...16 to 24 channels in the four-socket servers. In both cases, the maximum DIMM data rate increased from 2400 MT/s to 2666

Technical white paper Page 9

The industry-standard specification for DRAM requires the memory controller to refresh all capacitors on a DIMM every 64 ms. This is called a 1x Refresh rate. A DDR4 32 GB dual rank x4 RDIMM contains 288 billion capacitors. Occasionally, due to aging or access patterns, a capacitor may self-discharge quickly or may only be partially charged. To increase the DRAM resiliency and guard against a capacitor inadvertently switching states, select the 2x Refresh option in the RBSU menu as shown in Figure 5. With the 2x Refresh rate option, the memory controller refreshes each capacitor every 32 ms.

Figure 5. Selecting the Memory Refresh Rate

The DRAM with ECC and 1x Refresh rate and memory controller with single device data correct (SDDC), as used in HPE Gen10 servers, are very resilient to failures of individual capacitors in the DRAM. A 2x Refresh rate slightly improves the server resiliency and uptime but with a ~4% reduction in memory throughput.

HPE Fast Fault Tolerance HPE Fast Fault Tolerance is a new feature in HPE ProLiant Gen10 servers based on the Intel Scalable performance family of processors. It combines HPE error detection technology with a new feature called adaptive double-device data correction (ADDDC), which can correct up to two DRAM failures on a DIMM while the server keeps running. This is the same level of protection that was previously available only in high-end servers but without the permanent performance degradation. Unlike previous generations, where the protection was permanently enabled along with a 50% reduction in maximum throughput, ADDDC is adaptive and only enables the higher level of protection once errors are found on the first DRAM. Additionally, the protection is only enabled in the region where the error occurred, which preserves peak performance in the remaining memory regions that are not impacted by the memory error.

Online spare Online sparing provides protection against persistent DRAM failure. The system tracks correctable errors in all ranks. If a rank experiences excessive correctable errors, the system copies the contents of that rank to an available spare rank. This improves the system uptime by preventing correctable errors from becoming uncorrectable. It does not identify or disable individual failed DRAMs, but instead it disables the DIMM rank. Since a DIMM rank is needed to perform sparing, this technique reduces the total amount of available memory by the amount of memory used for sparing. Sparing can only handle one failure per channel. Ranks within a DIMM that are likely to receive a fatal/uncorrectable memory error are automatically removed from operation, resulting in less system downtime.

Optimizing for power Controlling memory speed The optimal memory operating speed in HPE Gen10 servers is based on the capability of the CPU and DIMMs. The DIMMs are rated at 2933 MT/s or 2666 MT/s (before speedbump). While the vast majority of Intel CPU SKUs used in HPE Gen10 servers are rated at 2933 MT/s, a few SKUs only support up to 2666 MT/s, 2400 MT/s, or even 2133 MT/s. Each CPU SKU is capable of operating down to 1866 MT/s. Lower data rates result in reduced power consumption and a reduction in the maximum available memory bandwidth. A lower data rate may be selected from the RBSU Memory Options menu. The memory bus speed may be set to any of the following in the HPE ProLiant and HPE Synergy Gen10 servers:

• Automatic (speed determined according to normal population rules)

• 2933 MT/s

• 2666 MT/s

• 2400 MT/s

• 2133 MT/s

• 1866 MT/s

Page 10: HPE DDR4 server memory performance in HPE ProLiant and …...16 to 24 channels in the four-socket servers. In both cases, the maximum DIMM data rate increased from 2400 MT/s to 2666

Technical white paper Page 10

Disabling memory channel interleaving Channel interleaving is enabled by default. This option is available to be disabled when the custom Workload profile is selected. Channel interleaving may be disabled in the RBSU Memory Options menu. Disabling memory interleaving reduces DIMM power consumption but may also decrease overall memory system performance depending on the type of workload.

Settings for memory operation The HPE server BIOS provides user control over several memory configuration settings. You may access and change these settings using the RBSU Memory Options menu, which is available in all HPE ProLiant and HPE Synergy Gen10 servers. To launch RBSU, press the F9 key during the server boot sequence.

Impact of CPU SKUs on memory performance In addition to the memory configuration, CPU selection can have a significant impact on performance. Whether it is core count, supported bus speed, or some other consideration, the CPU certainly plays a key role in determining the maximum memory throughput of the server. Figure 6 shows the impact of several CPU SKUs on memory throughput. Two bars are shown for each CPU SKU. The turquoise bar represents a read-only workload—read as 1:0 R:W ratio. The purple bar represents a 2:1 R:W ratio. The Gold 5122 and Gold 5118 exhibit higher memory throughputs with the R:W workload when compared to the read-only workload. The Gold 6134 and Platinum 8176 have a high throughput with the read-only workload. Note that while these measurements were executed with the 51xx, 61xx, and 81xx family processors, this data is illustrative of the relationships that will exist across the different CPUs supported in Gen10 servers.

Figure 6. The impact of CPU SKU on memory throughput

Core count As CPU core count grew over the years, so did the number of memory channels and operating data rate. As a result, a single CPU core is not able to generate enough memory traffic to saturate a single memory channel. It takes about 24 CPU cores running a memory-intensive workload to saturate six 2933 MT/s memory channels per CPU in HPE Gen10 servers. This implies that for memory throughput-sensitive workloads, the user must select a CPU SKU that has enough cores to achieve the desired memory throughput for the workload. Figure 7 shows the impact of active cores from various CPU SKUs on memory throughput. The number of active cores is shown on the x-axis and the throughput is shown on the y-axis. Each color on the chart represents a different CPU SKU. The inflection point at the center of each colored line is the point at which all the cores on one CPU socket are activated and there is addition of the first core on the second CPU in the server. Activating all the cores on one CPU socket before the cores on the second socket leads to unused resources and is included here to demonstrate the point. Ideally, cores should be activated uniformly across all installed sockets to take full advantage of available memory and throughput.

100000

120000

140000

160000

180000

200000

220000

240000

1:0 R:W 2:1 R:W 1:0 R:W 2:1 R:W 1:0 R:W 2:1 R:W 1:0 R:W 2:1 R:W

Gold 5122 3.6 GHz 4C 2666 Gold 6134 3.2 GHz 8C 2666 Gold 5118 2.3 GHz 12C 2400 Plat 8176 2.1 GHz 28C 2666

Thro

ughp

ut (M

B/s)

Impact of CPU SKU on memory throughput (HPE DL360 Gen10 server, HPC Workload Profile, 32 GB 2Rx4 1DPC DDR4-2666)

Page 11: HPE DDR4 server memory performance in HPE ProLiant and …...16 to 24 channels in the four-socket servers. In both cases, the maximum DIMM data rate increased from 2400 MT/s to 2666

Technical white paper Page 11

Figure 7. Impact of active cores on memory read throughput

Memory throughput is considered saturated at the point where additional cores do not cause an increase in throughput—the curve flattens out. Notice that the Gold 5118 with 12 cores and DDR4-2400 saturates at a lower throughput level than the Platinum 8176 with 28 cores and 2666 MT/s capability. The throughput delta between the two is due entirely to the difference in data rates. The SKUs with fewer cores are not capable of saturating the 12 memory channels in the 2-socket HPE DL360 Gen10 server. This data is collected on an HPE DL360 Gen10 server with the GPEC workload profile with 24 x 32 GB two rank x4 2666 MT/s DIMMs installed. Note that while these measurements were executed with the 51xx, 61xx, and 81xx family processors, this data is illustrative of the relationships that will exist across the different CPUs supported in Gen10 servers.

Impact of Hyper-Threading with high- and low-core count CPUs Intel supplies a wide variety of CPU SKUs that vary by clock speed, core count, memory data rate, and other capabilities. The external interfaces to the various SKUs are the same, but the internal architecture varies between them. When these internal differences are coupled with the various BIOS settings, they result is different performance characteristics depending on the workload. Figure 8 shows the impact of Intel Hyper-Threading (HT) technology and CPU SKU on memory throughput. The results show that HT reduces read throughput in low-core count CPUs but is on par with medium- and high-core count CPU SKUs. Note that while these measurements were executed with the 51xx, 61xx, and 81xx family processors, this data is illustrative of the relationships that will exist across the different CPUs supported in Gen10 servers.

0

50000

100000

150000

200000

250000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55

Read

thro

ughp

ut (M

B/s)

Active cores

Impact of active cores on read throughput (HPE DL360 Gen10 server, GPEC, 24 x 32 GB 2Rx4 2DPC DDR4-2666)

Gold 5122 3.6 GHz 4C 2666 Gold 6134 3.2 GHz 8C 2666 Gold 5118 2.3 GHz 12C 2400 Plat 8176 2.1 GHz 28C 2666

Page 12: HPE DDR4 server memory performance in HPE ProLiant and …...16 to 24 channels in the four-socket servers. In both cases, the maximum DIMM data rate increased from 2400 MT/s to 2666

Technical white paper Page 12

Figure 8. The impact of HT technology on memory throughput and idle latency

Figure 9 shows the single thread memory read performance of various CPU SKUs. The Gold 5118 has the lowest single thread throughput because it runs at 2400 MT/s instead of 2666 MT/s. The Gold 5122 has the highest single-threaded read throughput because it has the highest CPU clock speed and the internal architecture was optimized for single-threaded memory throughput. The Gold 6134 and the Platinum 8176 have vastly different CPU clock speeds but similar single-threaded throughput. This is because of CPU architectural optimizations on the high-end CPU SKU.

Figure 9. Single-threaded read throughput

122467

205811

157728

231536

98399

178007 159823

235142

0

50000

100000

150000

200000

250000

Gold 5122 3.6 GHz 4C 2666 Gold 6134 3.2 GHz 8C 2666 Gold 5118 2.3 GHz 12C 2400 Plat 8176 2.1 GHz 28C 2666

Mem

ory

thro

ughp

ut (M

B/s)

Impact of HyperThreading on memory throughput (HPE DL360 Gen10 server, HPC Workload Profile, 32 GB 2Rx4 1DPC DDR4-2666)

HT On HT Off

10000

10200

10400

10600

10800

11000

11200

11400

11600

Gold 5122 3.6 GHz 4C 2666 Gold 6134 3.2 GHz 8C 2666 Gold 5118 2.3 GHz 12C 2400 Plat 8176 2.1 GHz 28C 2666

Thro

ughp

ut (M

B/s)

Single-threaded read throughput HPE DL360 Gen10 server with two CPUs, GPEC, 24 x 32 GB 2Rx4 DDR4-2666 DIMMs

Page 13: HPE DDR4 server memory performance in HPE ProLiant and …...16 to 24 channels in the four-socket servers. In both cases, the maximum DIMM data rate increased from 2400 MT/s to 2666

Technical white paper Page 13

Maximum DDR4 data rate Some Intel Cascade Lake CPU SKUs support a maximum of memory data rate of 2933 MT/s. Others are limited to 2666 MT/s, 2400 MT/s, or 2133 MT/s. As discussed previously, there are RBSU options to lower the operating data rate down to 1866 MT/s. Lowering the operating data rate reduces memory throughput and DIMM power consumption. Figure 10 shows the maximum memory throughput for an HPE DL360 Gen10 server with two Platinum 8176 CPUs running at various data rates. The chart shows the throughput for a read-only workload (turquoise) and a 2:1 R:W workload (purple). The chart also shows the idle memory read latency in Orange. Note the slight reduction in memory read latency at the highest data rate. This data is collected with the default GPEC profile enabled.

Figure 10. Impact of memory operating data rate on throughput and idle latency

Conclusion Generation after generation, HPE continues to improve the performance, ease of use, and reliability of its HPE ProLiant and HPE Synergy servers. HPE Gen10 servers offer 81% more memory throughput than the previous generation. The Gen10 RBSU offers a single option to optimize numerous system configuration settings based on workloads such as HPC, virtualization, and mission critical, to name a few. And the HPE Fast Fault Tolerant feature increases system uptime while avoiding the performance penalty associated with this feature on previous generations.

50

55

60

65

70

75

80

85

90

95

100

50000

70000

90000

110000

130000

150000

170000

190000

210000

230000

250000

DDR4-1866 DDR4-2133 DDR4-2400 DDR4-2666

Late

ncy

(ns)

Thro

ughp

ut (M

B/s)

Data Rate

Impact of data rate on throughput and latency (HPE DL360 Gen10 server with Plat 8176, GPEC, 24 x 32 GB 2Rx4 DDR4-2666)

1:0 R:W 2:1 R:W Latency

Page 14: HPE DDR4 server memory performance in HPE ProLiant and …...16 to 24 channels in the four-socket servers. In both cases, the maximum DIMM data rate increased from 2400 MT/s to 2666

Technical white paper Page 14

Appendix A The following table shows the impact of HT, workload, BIOS Workload Profile, and CPU SKU on memory throughput and idle latency. Note that while these measurements were executed with the 51xx, 61xx, and 81xx family processors, the same type of relationship will exist across all CPUs supported in Gen10 servers.

Table 2. Memory throughput and latency across multiple workloads and CPU models

Workload Throughput (GB/s) Idle latency (ns)

CPU SKU Clock (GHz) Cores # DIMMs Data rate

Workload profile Read/Write ratio HT On HT Off HT On HT Off

Gold 5122 3.6 GHz 4C DDR4-2666

3.6 4 24 2400

GPEC 1:0 122.5 98.4 72.1 71.5

2:1 170.0 154.0

LL

1:0 122.2 97.3 72.0 71.8

2:1 169.6 154.0

HPC

1:0 122.4 97.4 72.0 71.6

2:1 169.8 153.5

Gold 6134 3.2 GHz 8C DDR4-2666

3.2 8 24 2666

GPEC 1:0 205.8 178.0 71.1 71.1

2:1 204.2 199.3

LL

1:0 204.9 175.5 72.1 72.5

2:1 204.2 198.9

HPC

1:0 205.7 177.4 71.3 71.7

2:1 204.2 198.8

Gold 5118 2.3 GHz 12C DDR4-2400

2.3 12 24 2400

GPEC 1:0 157.7 159.8 72.6 74.1

2:1 169.5 171.0

LL

1:0 157.4 159.7 75.9 75.4

2:1 169.5 171.1

HPC

1:0 157.7 159.7 74.9 74.5

2:1 169.5 171.1

Platinum 8176 2.1 GHz 28C DDR4-2666

2.1 28 24 2666

GPEC 1:0 231.5 235.1 69.1 69.5

2:1 213.1 214.6

LL 1:0 231.8 235.6 73.5 73.5

2:1 213.4 214.8

HPC 1:0 231.7 235.3 72.1 72.0

2:1 213.0 214.7

Page 15: HPE DDR4 server memory performance in HPE ProLiant and …...16 to 24 channels in the four-socket servers. In both cases, the maximum DIMM data rate increased from 2400 MT/s to 2666

Technical white paper Page 15

Appendix B The following table shows throughput, latency, and power consumption for the various DIMMs in the HPE Gen10 portfolio in a 1DPC and 2DPC configuration. The data represents both Intel Scalable Performance families with their respective 2666 MT/s or 2933 MT/s memory.

Table 3. Memory latency, throughput, and power consumption across multiple workloads, DIMM configurations and CPU families

System capacity

System throughput

(GB/s) Latency (ns) DIMM power (W)

HPE P/N DIMM description DIMM type DPC Data rate (MT/s) DIMMs (GB)

1:0 R:W

2:1 R:W Idle

1:0 R:W Idle

1:0 R:W

2:1 R:W

815097-B21 HPE 8 GB 1Rx8 PC4-2666V-R Smart Kit

RDIMM 1DPC 2666 12 96 221.4 176.6 59.6 168.2 0.1 3.0 3.1

2DPC 2666 24 192 224.3 202.1 59.7 167.7 0.1 2.3 2.4

815098-B21 HPE 16 GB 1Rx4 PC4-2666V-R Smart Kit

RDIMM 1DPC 2666 12 192 221.4 180.3 60.5 168.7 0.2 4.5 4.8

2DPC 2666 24 384 224.7 203.6 58.8 168.6 0.2 3.3 3.7

835955-B21 HPE 16 GB 2Rx8 PC4-2666V-R Smart Kit

RDIMM 1DPC 2666 12 192 225.9 203.7 61.5 166.7 0.2 3.8 4.2

2DPC 2666 24 384 226.0 211.3 60.7 173.4 0.2 2.8 3.1

Q2D31A HPE SGI 16 GB 2R x4 DDR4-2666 Memory Kit

RDIMM 1DPC 2666 12 192 228.1 208.1 57.7 163.9 0.3 6.1 6.8

2DPC 2666 24 384 227.4 213.3 58.3 172.0 0.3 4.2 4.7

815100-B21 HPE 32 GB 2Rx4 PC4-2666V-R Smart Kit

RDIMM 1DPC 2666 12 384 225.2 204.8 59.7 166.7 0.4 6.1 6.8

2DPC 2666 24 768 225.5 211.4 60.2 172.0 0.4 4.4 5.0

815101-B21 HPE 64 GB 4Rx4 PC4-2666V-L Smart Kit

LRDIMM 1DPC 2666 12 768 217.8 204.0 61.1 187.3 1.0 10.7 11.8

2DPC 2666 24 1536 214.4 199.4 69.3 189.2 1.1 7.8 8.4

815102-B21 HPE 128 GB 8Rx4 PC4-2666V-L Smart Kit

LRDIMM 1DPC 2666 12 1536 187.0 174.2 77.9 201.6 1.7 10.1 11.0

2DPC 2666 24 3072 184.9 166.3 88.9 218.3 2.0 7.8 8.3

P00918-B21 HPE 8 GB 1Rx8 PC4-2933Y-R Smart Kit

RDIMM 1DPC 2933 12 96 249.5 187.5 54.2 189.3 0.1 3.0 3.3

2DPC 2933 24 192 251.9 220.3 54.2 181.7 0.1 2.3 2.6

P00920-B21 HPE 16 GB 1Rx4 PC4-2933Y-R Smart Kit

RDIMM 1DPC 2933 12 192 249.8 192.8 54.0 187.4 0.2 4.5 4.9

2DPC 2933 24 384 251.8 222.5 53.6 181.8 0.2 2.7 3.0

P00922-B21 HPE 16 GB 2Rx8 PC4-2933Y-R Smart Kit

RDIMM 1DPC 2933 12 192 251.6 220.8 55.4 182.1 0.2 3.8 4.4

2DPC 2933 24 384 255.8 234.7 55.0 189.8 0.2 2.8 3.2

P00924-B21 HPE 32 GB 2Rx4 PC4-2933Y-R Smart Kit

RDIMM 1DPC 2933 12 384 251.6 222.7 54.4 182.5 0.3 5.8 6.7

2DPC 2933 24 768 255.7 234.8 55.4 189.3 0.4 4.2 4.8

P00930-B21 HPE 64 GB 2Rx4 PC4-2933Y-R Smart Kit

RDIMM 1DPC 2933 12 768 240.1 212.7 62.8 188.1 0.7 6.1 6.5

2DPC 2933 24 1536 242.4 228.2 68.6 194.5 0.7 4.8 5.1

P00926-B21 HPE 64 GB 4Rx4 PC4-2933Y-L Smart Kit

LRDIMM 1DPC 2933 12 768 255.0 232.9 54.6 189.4 1.0 10.3 11.0

2DPC 2933 24 1536 248.8 226.9 64.2 193.7 1.0 7.3 7.6

P00928-B21 HPE 128 GB 8Rx4 PC4-2933Y-L 3DS Smart Kit

LRDIMM 1DPC 2933 12 1536 215.4 192.1 81.5 236.2 1.9 11.4 12.0

2DPC 2933 24 3072 155.8 133.6 99.3 246.8 2.0 8.7 9.1

P11040-B21 HPE 128 GB 4Rx4 PC4-2933Y-L Smart Kit

LRDIMM 1DPC 2933 12 1536 240.8 226.1 63.6 195.7 1.7 10.0 11.0

2DPC 2933 24 3072 222.1 211.4 92.3 213.5 2.0 7.7 8.1

Page 16: HPE DDR4 server memory performance in HPE ProLiant and …...16 to 24 channels in the four-socket servers. In both cases, the maximum DIMM data rate increased from 2400 MT/s to 2666

Technical white paper

Resources HPE server technical white paper library

HPE Server Memory

HPE Server Memory Configurator

HPE Server Memory whiteboard video

Learn more at hpe.com/info/memory

Share now

Get updates

© Copyright 2019 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without notice. The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. Hewlett Packard Enterprise shall not be liable for technical or editorial errors or omissions contained herein.

Intel and Intel Xeon are trademarks of Intel Corporation in the U.S. and other countries. All other third-party marks are property of their respective owners.

a00063573ENW, April 2019, Rev. 1