ion performance brief hp dl980-8b

73
Performance Brief for the HP DL980 (Database Server) and DL380 (ION Data Accelerator) 4.24.2013

Upload: louis-liu

Post on 12-Jun-2015

445 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: ION performance brief   hp dl980-8b

Performance Brief for the HP DL980 (Database Server) and DL380 (ION Data Accelerator™)

4.24.2013

Page 2: ION performance brief   hp dl980-8b

Copyright Notice

The information contained in this document is subject to change without notice.

Fusion-io MAKES NO WARRANTY OF ANY KIND WITH REGARD TO THIS MATERIAL, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. Except to correct same after receipt of reasonable notice, Fusion-io shall not be liable for errors contained herein or for incidental and/or consequential damages in connection with the furnishing, performance, or use of this material.

The information contained in this document is protected by copyright.

© 2013, Fusion-io, Inc. All rights reserved.

Fusion-io, the Fusion-io logo and ioDrive are registered trademarks of Fusion-io in the United States and other countries.

The names of other organizations and products referenced herein are the trademarks or service marks (as applicable) of their respective owners. Unless otherwise stated herein, no association with any other organization or product referenced herein is intended or should be inferred.

Fusion-io: 2855 E. Cottonwood Parkway, Box 100 Salt Lake City, UT 84121 USA

(801) 424-5500

Page 3: ION performance brief   hp dl980-8b

CONTENTS

Introduction ............................................................................................................................................. 1

HARDWARE ................................................................................................................................... 2

ION Data Accelerator System ................................................................................................... 2

Initiator System ........................................................................................................................ 2

Storage Configuration .............................................................................................................................. 3

INITIATOR HBA PLACEMENT ........................................................................................................... 3

ION DATA ACCELERATOR STORAGE POOL CONFIGURATION ........................................................ 5

ION VOLUME CONFIGURATION...................................................................................................... 5

ION LUN CONFIGURATION ............................................................................................................. 6

MULTIPATH VERIFICATION ............................................................................................................. 8

Initiator BIOS Tuning .............................................................................................................................. 11

UPDATING THE BIOS FOR NUMA DETECTION ............................................................................... 12

POWER MANAGEMENT OPTIONS................................................................................................. 12

SYSTEM OPTIONS ......................................................................................................................... 14

ADVANCED OPTIONS ................................................................................................................... 16

Setting the Addressing Mode ................................................................................................. 16

Disabling x2APIC .................................................................................................................... 17

Initiator Tuning on Linux ........................................................................................................................ 18

MULTIPATHING ............................................................................................................................ 18

DISABLING PROCESSOR C-STATES IN LINUX ................................................................................. 18

IONTUNER RPM ............................................................................................................................ 19

Block Device Tuning with udev Rules ..................................................................................... 20

Disabling the cpuspeed Daemon ............................................................................................ 21

Pinning interrupts .................................................................................................................. 21

VERIFYING THREAD PINNING ........................................................................................................ 22

Oracle Tuning......................................................................................................................................... 25

HUGEPAGES................................................................................................................................. 25

SYSCTL PARAMETERS .................................................................................................................. 25

ORACLE INITIALIZATION PARAMETERS ......................................................................................... 26

fio Performance Testing ......................................................................................................................... 27

PRECONDITIONING FLASH STORAGE ........................................................................................... 27

TESTING THREAD CPU AFFINITY ................................................................................................... 27

Page 4: ION performance brief   hp dl980-8b

TEST COMMANDS ....................................................................................................................... 27

RESULTS ....................................................................................................................................... 30

SEQUENTIAL R/W THROUGHPUT AND IOPS .................................................................................. 31

RANDOM MIX R/W IOPS .............................................................................................................. 32

RANDOM MIX R/W THROUGHPUT ............................................................................................... 32

Oracle Performance Testing .................................................................................................................... 34

TEST SETUP .................................................................................................................................. 34

TEST COMMANDS ....................................................................................................................... 36

RESULTS ....................................................................................................................................... 37

Oracle Database Testing ......................................................................................................................... 38

READ WORKLOAD TEST – QUEST BENCHMARK FACTORY ........................................................... 38

OLTP WORKLOAD TEST – HEAVY INSERT SCRIPT .......................................................................... 43

TRANSACTIONS TEST – SWINGBENCH ......................................................................................... 47

Conclusions............................................................................................................................................ 48

Glossary ................................................................................................................................................. 49

Appendix A: Tuning Checklist ................................................................................................................ 50

Appendix B: Speeding up Oracle Database Performance with ioMemory – an HP Session ....................... 52

ARCHITECTURE OVERVIEW .......................................................................................................... 52

ABOUT ION DATA ACCELERATOR ................................................................................................ 53

ION Data Accelerator Software .............................................................................................. 53

Fusion-Powered Storage Stack ............................................................................................... 53

Why ION Data Accelerator? ................................................................................................... 54

ABOUT ION DATA ACCELERATOR HA (HIGH AVAILABILITY) ........................................................ 54

PERFORMANCE TEST RESULTS: HP DL380 / HP DL980 .................................................................. 55

OVERVIEW OF THE ION DATA ACCELERATOR GUI ....................................................................... 57

COMPARATIVE SOLUTIONS .......................................................................................................... 60

BEST PRACTICES .......................................................................................................................... 61

BENCHMARK TEST CONFIGURATION ........................................................................................... 62

RAW PERFORMANCE TEST RESULTS WITH FIO ............................................................................. 63

Total IOPS .............................................................................................................................. 63

Average Completion Latency (Microseconds) ......................................................................... 64

Raw I/O Test: 70% Read, 30% Write ..................................................................................... 64

Raw I/O Test: 100% Read at 8KB ........................................................................................... 65

Raw I/O Test: Read Latency (Microseconds) ............................................................................ 65

ORACLE WORKLOAD TESTS ......................................................................................................... 66

Page 5: ION performance brief   hp dl980-8b

Introduction

________________________________________________________________________

This document describes methods used to maximize performance for Oracle Database Server running on an HP DL980 and for ION Data Accelerator running on an HP DL380. These methods should provide a foundation for tuning methods with a variety of tests and customer applications.

The non-uniform memory access (NUMA) architecture of the DL980 presents challenges in minimizing data transfers between multiple processor nodes, while efficiently distributing I/O processing across available resources. Without any tuning, a configuration capable of as much as 700,000 IOPS may instead achieve no more than 160,000 IOPS. Likewise, a system capable of bandwidths of up to 7 GB/s may be limited to 3.5 GB/s. Testing performed with an un-tuned initiator may reflect poorly on ION Data Accelerator performance, when in reality the ION Data Accelerator software is not the problem.

The goals of this document are to

• Provide an example of what is possible with a specific configuration.

• Provide the tools necessary to improve performance on a variety of DL980 configurations, or with other initiator servers used with ION Data Accelerator.

Depending on the ioDrives and HBAs used, as well as fabric connectivity, you may need to vary the tuning described in this document. A script has been provided to perform the most complex tuning operations, but the steps performed by the script are fully described so you can adapt them for a variety of configurations.

These tuning methods were originally used to maximize performance at HP European Performance Center in Böblingen. A similar configuration was recreated at Fusion-io in San Jose, and the performance results described in this document are the results of that testing. Though there were minor variations between the two configurations, similar performance was achieved.

For more details on the features and functionality of ION Data Accelerator, refer to the ION Data Accelerator User Guide.

1

Page 6: ION performance brief   hp dl980-8b

HARDWARE This section describes the hardware components used in the performance testing of the ION Data Accelerator appliance with its initiator.

ION Data Accelerator System

• DL380p Gen8 server

• 2 x Intel Xeon E5-2640 CPUs (6 cores each, 2.5 GHz)

• 64GB RAM

• 3 x 2.41TB ioDrive2 Duos

• 1 x QLogic 8Gbit Fibre Channel quad-port HBA

• 2 x QLogic 8Gbit Fibre Channel dual-port HBAs

• ION Data Accelerator 2.0.0 build 349 (VSL 3.2.3 build 950)

Initiator System

• HP DL980 Gen7 server

• 8 x Intel Xeon E7-4870 CPUs (10 cores each, 2.4 GHz)

• 256 GB RAM

• 3 x Emulex 8 Gbit Fibre Channel dual-port HBAs

• 1 x QLogic 8 Gbit Fibre Channel dual-port HBA

• Red Hat Enterprise Linux 6.3

• Oracle Database 11g Enterprise Edition 64-bit Release 11.2.0.3.0 with ASM

2

Page 7: ION performance brief   hp dl980-8b

Storage Configuration ________________________________________________________________________

INITIATOR HBA PLACEMENT The NUMA architecture of the DL980 must be considered when choosing where to place HBAs. PCIe slots 7, 8, 9, 10, and 11 are attached to the I/O hub nearest to CPU sockets 0 and 1. PCIe slots 1, 2, 3, 4, 5, and 6 are attached to the I/O hub nearest to CPU sockets 2 and 3. PCIe slots 12, 13, 14, 15, and 16 are attached to the I/O hub nearest to CPU sockets 4 and 5.

In the configurations used at HP Böblingen and Fusion-io San Jose, two HBAs were placed in slots from 1 through 6, and two HBAs were placed in slots from 7 through 11. In that configuration, I/O

3

Page 8: ION performance brief   hp dl980-8b

traffic is split between two I/O hubs. By using multiple I/O Hubs, more CPU cores can access data from the HBAs at a low cost, but there is a risk of transferring data between I/O hubs, which may cause poor performance. It is important to configure volume access such that no single volume is accessed from multiple I/O hubs. Note that even though a PCIe slot may be equidistant from two nodes, there is still less latency between cores within a node than between CPU cores on separate nodes attached to the same I/O hub.

Although the diagram above shows slots 12 through 16 attached to CPU sockets 6 and 7, other documentation from HP suggests that these slots are attached to nodes 4 and 5. If using the expansion slots, it is best to manually check the location of the PCIe slots.

You can use lspci to find the bus addresses of HBAs in the system:

# lspci | grep "Fibre Channel" 0b:00.0 Fibre Channel: ... 0b:00.1 Fibre Channel: ... 11:00.0 Fibre Channel: ... 11:00.1 Fibre Channel: ... 54:00.0 Fibre Channel: ... 54:00.1 Fibre Channel: ... 60:00.0 Fibre Channel: ... 60:00.1 Fibre Channel: ...

You can also use dmidecode to determine the PCI slot associated with each bus address:

# dmidecode -t slot ... Handle 0x0908, DMI type 9, 17 bytes System Slot Information Designation: PCI-E Slot 9 Type: x8 PCI Express 2 x16 Current Usage: In Use Length: Long ID: 9 Characteristics: 3.3 V is provided PME signal is supported Bus Address: 0000:0b:00.0 ... Handle 0x090A, DMI type 9, 17 bytes System Slot Information Designation: PCI-E Slot11 Type: x8 PCI Express 2 x16 Current Usage: In Use Length: Long

4

Page 9: ION performance brief   hp dl980-8b

ID: 11 Characteristics: 3.3 V is provided PME signal is supported Bus Address: 0000:11:00.0 ... Handle 0x0901, DMI type 9, 17 bytes System Slot Information Designation: PCI-E Slot 2 Type: x8 PCI Express 2 x16 Current Usage: In Use Length: Long ID: 2 Characteristics: 3.3 V is provided PME signal is supported Bus Address: 0000:54:00.0 ... Handle 0x0905, DMI type 9, 17 bytes System Slot Information Designation: PCI-E Slot 6 Type: x8 PCI Express 2 x16 Current Usage: In Use Length: Long ID: 6 Characteristics: 3.3 V is provided PME signal is supported Bus Address: 0000:60:00.0

ION DATA ACCELERATOR STORAGE POOL CONFIGURATION A RAID 0 set was created using all three ioDrive2 Duo cards present in the ION Data Accelerator system. This was done by using the following CLI command to create a storage profile for maximum performance:

admin@/> profile:create max_performance

ION VOLUME CONFIGURATION Eight volumes of equal size were created from the storage pool, using the following CLI commands:

admin@/> volume:create volume0 841 pool_md0 admin@/> volume:create volume1 841 pool_md0

5

Page 10: ION performance brief   hp dl980-8b

admin@/> volume:create volume2 841 pool_md0 admin@/> volume:create volume3 841 pool_md0 admin@/> volume:create volume4 841 pool_md0 admin@/> volume:create volume5 841 pool_md0 admin@/> volume:create volume6 841 pool_md0 admin@/> volume:create volume7 841 pool_md0

For ION Data Accelerator configurations with many ioDrives, it may be necessary to use 16 or more volumes to achieve maximum performance.

ION LUN CONFIGURATION To provide sufficient performance as well as redundancy, LUN access should be provided through multiple ION Data Accelerator targets and multiple initiator cards. Additionally, because of the NUMA architecture characteristics of the DL980, it may be best to localize access for each volume to a single I/O hub. Volumes should be exposed so that traffic is distributed evenly across all ports.

The diagram below shows the link configuration that was used at HP Böblingen.

Figure 1. Link configuration used at HP Boblingen

Four ports on the ION Data Accelerator system were connected to eight ports on the DL980 initiator, through a switch. On the initiator, two dual-port cards were placed in I/O hub 1 and in I/O hub 2. Exports were created on the four ports of the ION Data Accelerator to the four ports on each I/O hub of the initiator.

Each volume was exported on two links:

6

Page 11: ION performance brief   hp dl980-8b

• Volume 0: t1 to i1, t4 to i4

• Volume 1: t2 to i2, t3 to i3

• Volume 2: t3 to i7, t2 to i6

• Volume 3: t1 to i5, t4 to i8

The same access pattern was repeated with every set of four subsequent volumes. Notice that access to each volume is localized to a single I/O hub on the initiator.

The diagram below shows the link configuration that was used at Fusion-io San Jose.

Figure 2. Link configuration used at Fusion-io San Jose

Because a switch was unavailable, eight ports on the ION Data Accelerator system were directly connected to eight ports on the initiator.

Each volume was exported on two links:

• Volume 0: t1 to i1, t6 to i4

• Volume 1: t3 to i5, t8 to i8

• Volume 2: t2 to i2, t5 to i3

• Volume 3: t4 to i6, t7 to i7

The same access pattern was repeated with every set of four subsequent volumes. Notice that access to each volume is once again localized to a single I/O hub on the initiator.

The following CLI commands were used to create initiator groups and LUNs on the ION Data Accelerator system at Fusion-io San Jose:

7

Page 12: ION performance brief   hp dl980-8b

admin@/> inigroup:create i1 10:00:00:90:fa:14:a1:fc admin@/> inigroup:create i2 10:00:00:90:fa:14:a1:fd admin@/> inigroup:create i3 10:00:00:90:fa:14:f9:d4 admin@/> inigroup:create i4 10:00:00:90:fa:14:f9:d5 admin@/> inigroup:create i5 10:00:00:90:fa:1b:03:c8 admin@/> inigroup:create i6 10:00:00:90:fa:1b:03:c9 admin@/> inigroup:create i7 21:00:00:24:ff:46:bf:ca admin@/> inigroup:create i8 21:00:00:24:ff:46:bf:cb

admin@/> lun:create -b 512 volume0 i1 21:00:00:24:ff:69:d3:4c admin@/> lun:create -b 512 volume0 i6 21:00:00:24:ff:46:c0:b5 admin@/> lun:create -b 512 volume1 i3 21:00:00:24:ff:69:d3:4e admin@/> lun:create -b 512 volume1 i8 21:00:00:24:ff:45:f4:ad admin@/> lun:create -b 512 volume2 i2 21:00:00:24:ff:69:d3:4d admin@/> lun:create -b 512 volume2 i5 21:00:00:24:ff:46:c0:b4 admin@/> lun:create -b 512 volume3 i4 21:00:00:24:ff:69:d3:4f admin@/> lun:create -b 512 volume3 i7 21:00:00:24:ff:45:f4:ac admin@/> lun:create -b 512 volume4 i1 21:00:00:24:ff:69:d3:4c admin@/> lun:create -b 512 volume4 i6 21:00:00:24:ff:46:c0:b5 admin@/> lun:create -b 512 volume5 i3 21:00:00:24:ff:69:d3:4e admin@/> lun:create -b 512 volume5 i8 21:00:00:24:ff:45:f4:ad admin@/> lun:create -b 512 volume6 i2 21:00:00:24:ff:69:d3:4d admin@/> lun:create -b 512 volume6 i5 21:00:00:24:ff:46:c0:b4 admin@/> lun:create -b 512 volume7 i4 21:00:00:24:ff:69:d3:4f admin@/> lun:create -b 512 volume7 i7 21:00:00:24:ff:45:f4:ac

MULTIPATH VERIFICATION When the steps above have been completed and dm-multipath has been started on the initiator, the multipath command may be used to verify the configuration.

# multipath –ll mpathhes (23337613362643333) dm-2 FUSIONIO,ION LUN size=783G features='3 queue_if_no_path pg_init_retries 50' hwhandler='0' wp=rw `-+- policy='queue-length 0' prio=1 status=active |- 1:0:0:0 sdd 8:48 active ready running `- 2:0:0:0 sdf 8:80 active ready running mpathhez (23330633436333064) dm-7 FUSIONIO,ION LUN size=783G features='3 queue_if_no_path pg_init_retries 50' hwhandler='0' wp=rw `-+- policy='queue-length 0' prio=1 status=active |- 4:0:0:1 sdk 8:160 active ready running `- 7:0:0:1 sdq 65:0 active ready running

8

Page 13: ION performance brief   hp dl980-8b

mpathhey (23437373930653063) dm-4 FUSIONIO,ION LUN size=783G features='3 queue_if_no_path pg_init_retries 50' hwhandler='0' wp=rw `-+- policy='queue-length 0' prio=1 status=active |- 0:0:0:1 sdc 8:32 active ready running `- 3:0:0:1 sdi 8:128 active ready running mpathhex (26433343437616137) dm-8 FUSIONIO,ION LUN size=783G features='3 queue_if_no_path pg_init_retries 50' hwhandler='0' wp=rw `-+- policy='queue-length 0' prio=1 status=active |- 5:0:0:1 sdm 8:192 active ready running `- 6:0:0:1 sdo 8:224 active ready running mpathhew (23061313364323662) dm-5 FUSIONIO,ION LUN size=783G features='3 queue_if_no_path pg_init_retries 50' hwhandler='0' wp=rw `-+- policy='queue-length 0' prio=1 status=active |- 1:0:0:1 sde 8:64 active ready running `- 2:0:0:1 sdg 8:96 active ready running mpathhev (26432353466383337) dm-6 FUSIONIO,ION LUN size=783G features='3 queue_if_no_path pg_init_retries 50' hwhandler='0' wp=rw `-+- policy='queue-length 0' prio=1 status=active |- 4:0:0:0 sdj 8:144 active ready running `- 7:0:0:0 sdp 8:240 active ready running mpathheu (23637366232363564) dm-3 FUSIONIO,ION LUN size=783G features='3 queue_if_no_path pg_init_retries 50' hwhandler='0' wp=rw `-+- policy='queue-length 0' prio=1 status=active |- 0:0:0:0 sdb 8:16 active ready running `- 3:0:0:0 sdh 8:112 active ready running mpathhet (23632393433663839) dm-9 FUSIONIO,ION LUN size=783G features='3 queue_if_no_path pg_init_retries 50' hwhandler='0' wp=rw `-+- policy='queue-length 0' prio=1 status=active |- 5:0:0:0 sdl 8:176 active ready running `- 6:0:0:0 sdn 8:208 active ready running

Notice that there are eight multipath devices, each comprised of two LUNs. Each path has a number associated with it, of the form <host>:0:0:<lun#>. The host numbers correspond to specific PCI device ports. A PCI device address can be correlated to a host number by looking in sysfs:

# ls -d /sys/bus/pci/devices/*/host* /sys/bus/pci/devices/0000:11:00.0/host0

9

Page 14: ION performance brief   hp dl980-8b

/sys/bus/pci/devices/0000:11:00.1/host1 /sys/bus/pci/devices/0000:0b:00.0/host2 /sys/bus/pci/devices/0000:0b:00.1/host3 /sys/bus/pci/devices/0000:54:00.0/host4 /sys/bus/pci/devices/0000:54:00.1/host5 /sys/bus/pci/devices/0000:60:00.0/host6 /sys/bus/pci/devices/0000:60:00.1/host7

For example, multipath device mpathhet has paths through hosts 5 and 6 (shown by the numbers 5:0:0:0 and 6:0:0:0), which correspond to devices 0000:54:00.1 and 0000:60:00.0. The output from the dmidecode command used in the Initiator HBA Placement section shows that this volume is exposed through HBAs in slots 2 and 6, which are both in the same I/O hub. It is important that each volume presented in multipath is accessed only through HBAs in the same I/O hub.

10

Page 15: ION performance brief   hp dl980-8b

Initiator BIOS Tuning ________________________________________________________________________

The following settings should be applied on the HP DL980 initiator, using the ROM-Based Setup Utility (RBSU) on boot.

To enter the RBSU, press F9 during boot (when the F9 Setup option appears on the screen).

11

Page 16: ION performance brief   hp dl980-8b

UPDATING THE BIOS FOR NUMA DETECTION In the DL980 BIOS version dated 05/01/2012, a change was made to the SLIT node distances. This may affect performance, so it is recommended that the latest version of the BIOS be used. Incorrect SLIT node distances are a common issue with early BIOS revisions on many platforms.

The BIOS version can be determined from the main BIOS screen. Alternatively, numactl can be used to verify that the node distances match the table below:

# numactl –hardware ...

node distances: node 0 1 2 3 4 5 6 7 0: 10 12 17 17 19 19 19 19 1: 12 10 17 17 19 19 19 19 2: 17 17 10 12 19 19 19 19 3: 17 17 12 10 19 19 19 19 4: 19 19 19 19 10 12 17 17 5: 19 19 19 19 12 10 17 17 6: 19 19 19 19 17 17 10 12 7: 19 19 19 19 17 17 12 10

POWER MANAGEMENT OPTIONS To enable maximum performance, disable the HP power management options.

1. Select Power Management Options > HP Power Profile > Maximum Performance.

12

Page 17: ION performance brief   hp dl980-8b

13

Page 18: ION performance brief   hp dl980-8b

2. Verify that C-states have been disabled by selecting Power Management Options > Advanced Power Management Options > Minimum Processor Idle Power Core State.

“No C-states” should be highlighted in the menu.

C-states may also need to be disabled in Linux, as explained later in this document.

SYSTEM OPTIONS Intel Hyperthreading may or may not be beneficial to ION Data Accelerator performance. In this test setup, Hyperthreading was enabled. Other system options were set as described below.

1. Enable hyperthreading by selecting System Options > Processor Options > Intel Hyperthreading Options > Enabled.

14

Page 19: ION performance brief   hp dl980-8b

2. Disable Virtualization if it is not required, by selecting System Options >Processor Options > Intel Virtualization Technology > Disabled.

15

Page 20: ION performance brief   hp dl980-8b

3. Disable VT-d (Virtualization Technology for Directed I/O) by selecting System Options > Processor Options > Intel VT-d > Disabled.

ADVANCED OPTIONS

Setting the Addressing Mode

The preferred addressing mode depends on the operating system and the amount of memory used.

For all RHEL 5.x installations, use 40-bit addressing. For RHEL 6.x installations, use 40-bit addressing when 1TB or less memory is present; otherwise, 44-bit addressing must be used to take advantage of all available memory. To disable 44-bit addressing, select Advanced Options > Advanced System ROM Options > Address Mode 44-bit > Disabled.

16

Page 21: ION performance brief   hp dl980-8b

For RHEL 6.x installations using greater than 1 TB of memory, use 44-bit addressing: Advanced Options > Advanced System ROM Options > Address Mode 44-bit > Enabled.

At HP Böblingen, the DL980 contained 1TB of memory, so 40-bit addressing was sufficient.

Disabling x2APIC

To verify that x2APIC is disabled, select Advanced Options > Advanced System ROM Options > x2APIC Options. The “Disabled” option should be highlighted; select it if it is not.

17

Page 22: ION performance brief   hp dl980-8b

Initiator Tuning on Linux ________________________________________________________________________

The following settings should be configured in Linux. In some cases, a reboot must be applied in order for changes to take effect.

MULTIPATHING Typically, the preferred queuing technique is to send I/O to the path with the least number of I/Os currently queued. The following is an example of how the multipath.conf file can be configured, using a path_selector of “queue-length 0”:

device { vendor "FUSIONIO" product "*" path_selector "queue-length 0" rr_min_io_rq 1 rr_weight uniform no_path_retry 20 failback 60 path_grouping_policy multibus path_checker tur }

Another approach that may provide better results is setting path_selector to “round-robin”. The round-robin value uses fewer CPU cycles, but it does not correct for unbalanced performance characteristics of multiple paths, or any additional load from other devices that may be slowing down one of the paths.

DISABLING PROCESSOR C-STATES IN LINUX For newer Linux kernels (2.6.32 or later) disabling CPU idle power states can boost performance.

18

Page 23: ION performance brief   hp dl980-8b

However, these must be disabled at boot time rather than in the BIOS.

To disable CPU states, add intel_idle.max_cstate=0 processor.max_cstate=0 boot parameters to the /boot/grub/grub.conf file as follows:

title Red Hat Enterprise Linux (2.6.32-279.el6.x86_64) root (hd0,0) kernel /vmlinuz-2.6.32-279.el6.x86_64 ro root=/dev/mapper/vg_rhel980-lv_root rd_NO_LUKS LANG=en_US.UTF-8 rd_LVM_LV=vg_rhel980/lv_root rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=128M rd_LVM_LV=vg_rhel980/lv_swap KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet intel_idle.max_cstate=0 processor.max_cstate=0 initrd /initramfs-2.6.32-279.el6.x86_64.img

One way to verify that the CPU states have been disabled entirely is to verify that the CPU state sysfs files do not exist:

# ls /sys/devices/system/cpu/cpu0/cpuidle ls: cannot access /sys/devices/system/cpu/cpu0/cpuidle: No such file or directory

IONTUNER RPM The tuning suggestions in this section can be performed in one step by installing the iontuner RPM. The RPM is made available on the Fusion-io internal network:

https://confluence.int.fusionio.com/display/ION/Documentation#Documentation-IONPerformanceBrief,HPDL980(INTERNAL-ONLY)

The RPM can be installed with the following command (the RPM version may be different):

# rpm –Uvh iontuner-0.0.2-1.noarch.rpm

If ION LUNs have already been detected by the initiator, a reboot or reload of device drivers may be necessary after the RPM install. This servers to complete the tuning that is performed upon device discovery. If in doubt about LUN discovery, reboot.

The tuning described in the following sub-sections is done by the iontuner RPM, and it does not need to be performed manually if the RPM has been installed. Detailed steps are provided here in order to completely describe the RPM function and to assist those who may need to adjust the steps for unsupported platforms.

19

Page 24: ION performance brief   hp dl980-8b

Block Device Tuning with udev Rules

Str The tuning in this section is performed by the iontuner RPM.

To improve I/O performance, you should tune the I/O scheduling queues on all devices in the data path. This includes both the individual SCSI devices (/dev/sd*) and the device-mapper devices (/dev/dm-*).

Three settings changes have been proven to provide a performance benefit under some workloads:

1) Always use the noop queue with ION Data Accelerator devices:

# echo noop > /sys/block/<device>/queue/scheduler

2) Use strict block-request affinity. This forces the handling of I/O completion to occur on the same CPU where the request was initiated.

# echo 2 > /sys/block/<device>/queue/rq_affinity Strict block-request affinity is not available on RHEL 5, and on some kernels, group affinity will be used where strict affinity is not supported. After setting the file to ‘2’, a read of the file will return ‘1’ if only CPU group affinity is available.

3) To get more consistent performance results, disable entropy pool contribution:

# echo 0 > /sys/block/<device>/queue/add_random

The methods described above must be run after multipath devices are configured and detected by the initiator, and they will not persist through a reboot. This is because Linux provides the udev rules mechanism, which allows for some sysfs parameters to be set upon device discovery, both at boot time and run time.

The iontuner RPM installs the following rules in /etc/udev/rules.d/99-iontuner.rules:

ACTION=="add|change", SUBSYSTEM=="block", ATTR{device/vendor}=="FUSIONIO", ATTR{queue/scheduler}="noop", ATTR{queue/rq_affinity}="2", ATTR{queue/add_random}="0"

ACTION=="add|change", KERNEL=="dm-*", PROGRAM="/bin/bash -c 'cat /sys/block/$name/slaves/*/device/vendor | grep FUSIONIO'", ATTR{queue/scheduler}="noop", ATTR{queue/rq_affinity}="2", ATTR{queue/add_random}="0"

The first rule applies scheduler, rq_affinity, and add_random changes to all SCSI block devices (/dev/sd*) whose vender is FUSIONIO.

The second rule applies scheduler, rq_affinity, and add_random changes to all DM multipath devices (/dev/dm-*) that are created on top of block devices whose vendor is FUSIONIO.

20

Page 25: ION performance brief   hp dl980-8b

Disabling the cpuspeed Daemon

Str The tuning in this section is performed by the iontuner RPM.

Disabling the cpuspeed daemon on Linux can boost overall performance. To disable the cpuspeed daemon immediately, run this command:

# service cpuspeed stop

To prevent the cpuspeed daemon from running after a reboot, run this command:

# chkconfig cpuspeed off

Pinning interrupts

Str The tuning in this section is performed by the iontuner RPM.

To minimize data transfer and synchronization throughout the system, I/O interrupts should be handled on a socket close to the HBA’s I/O hub.

When manually configuring IRQs, the irqbalance daemon must first be disabled. To disable the irqbalance daemon immediately, run this command:

# service irqbalance stop

To prevent the irqbalance daemon from running after a reboot, run this command:

# chkconfig irqbalance off

IRQs should be pinned for each driver that handles interrupts for ION device I/O. Typically, this is just the HBA driver. Driver IRQs can be identified in /proc/interrupts by the matching IRQ numbers to the driver prefix listed in the same row. The following table shows some common drivers and the prefix necessary to identify driver IRQs:

Driver Prefix

QLogic FC qla

Brocade FC bfa

Emulex FC lpfc

Emulex iSCSI beiscsi,eth

The iontuner RPM installs the iontuner service init script. This runs at boot time to distribute IRQs across the CPU cores local to HBA’s I/O hub. Below is an example of the commands issued at

21

Page 26: ION performance brief   hp dl980-8b

startup::

echo 00000000,00000000,00000000,00000000,00000001 > /proc/irq/114/smp_affinity

echo 00000000,00000000,00000000,00000000,00000002 > /proc/irq/115/smp_affinity

echo 00000000,00000000,00000000,00000000,00000004 > /proc/irq/116/smp_affinity

echo 00000000,00000000,00000000,00000000,00000008 > /proc/irq/117/smp_affinity

echo 00000000,00000000,00000000,00000000,00000010 > /proc/irq/118/smp_affinity

echo 00000000,00000000,00000000,00000000,00000020 > /proc/irq/119/smp_affinity

echo 00000000,00000000,00000000,00000000,00000040 > /proc/irq/120/smp_affinity

echo 00000000,00000000,00000000,00000000,00000080 > /proc/irq/121/smp_affinity

echo 00000000,00000000,00000000,00000000,00100000 > /proc/irq/134/smp_affinity

echo 00000000,00000000,00000000,00000000,00200000 > /proc/irq/135/smp_affinity

echo 00000000,00000000,00000000,00000000,00400000 > /proc/irq/136/smp_affinity

echo 00000000,00000000,00000000,00000000,00800000 > /proc/irq/137/smp_affinity

echo 00000000,00000000,00000000,00000000,01000000 > /proc/irq/122/smp_affinity

echo 00000000,00000000,00000000,00000000,02000000 > /proc/irq/123/smp_affinity

echo 00000000,00000000,00000000,00000000,04000000 > /proc/irq/124/smp_affinity

echo 00000000,00000000,00000000,00000000,08000000 > /proc/irq/125/smp_affinity

Affinity is set by writing to the /proc/irq/<irq#>/smp_affinity file for a given IRQ. Each IRQ is assigned affinity to a different CPU core on a node nearest to the IRQ’s PCIe device. In smp_affinity files, each core is represented by a single bit, starting with the least significant bit mapping to CPU 0. The IRQs associated with each device driver can be found by reading the /proc/interrupts file.

There are ten CPU cores per node. In the example above, eight interrupts (the first eight entries) for the devices in slots 9 and 11 are mapped to node 0, and eight interrupts (the last eight entries) for the devices in slots 2 and 6 are mapped to node 2. On the DL980, each PCIe slot can be efficiently assigned to either of the nodes corresponding to its I/O hub. However, it is important that all processes related to that device be assigned to the same node.

Because these settings will not persist through a reboot, the iontuner service runs each time the system is booted.

VERIFYING THREAD PINNING Str The tuning in this section was not necessary in the DL980/RHEL 6.3 testing. It is included because

it is unknown at this time whether it may be necessary on other platforms.

To further minimize data transfer and synchronization times throughout the system, it may be beneficial to place critical I/O driver threads on the same socket as the interrupts and HBA. This may only be necessary with some drivers. For instance, this is helpful with QLogic drivers but is not necessary when using Emulex drivers because no critical work is performed in Emulex driver threads. In the case of the DL980 running RHEL 6.3, the QLogic driver threads always ran on cores local to the HBAs, even though they were not pinned.

22

Page 27: ION performance brief   hp dl980-8b

To check where QLogic driver threads are executing, run the following command:

# ps –eo comm,psr | grep qla qla2xxx_6_dpc 20 qla2xxx_7_dpc 20

The number beside each process indicates the core it is currently executing on.

The numbers “6” and “7” in the above example correspond to specific PCI device host numbers. You can correlate a PCI device to a host number by looking in sysfs:

# ls -d /sys/bus/pci/devices/*/host* /sys/bus/pci/devices/0000:11:00.0/host0 /sys/bus/pci/devices/0000:11:00.1/host1 /sys/bus/pci/devices/0000:0b:00.0/host2 /sys/bus/pci/devices/0000:0b:00.1/host3 /sys/bus/pci/devices/0000:54:00.0/host4 /sys/bus/pci/devices/0000:54:00.1/host5 /sys/bus/pci/devices/0000:60:00.0/host6 /sys/bus/pci/devices/0000:60:00.1/host7

The CPUs local to each PCI device can also be found in sysfs:

# cat /sys/bus/pci/devices/0000:54:00.0/local_cpulist 20-29,100-109

If the device thread is not executing on one of the listed cores, run the following command:

# /usr/sbin/iontuner.py --pinqladriver

The output from the script shows the commands it issued:

taskset -pc 20-29,100-109 947 taskset -pc 20-29,100-109 942

The script assigns CPU affinity for each discovered PID through the taskset command, using the following parameters:

# taskset –pc <CPU mask> <PID>

PIDs can be discovered through the ps command, but each driver has its own naming convention for these processes. For example, the following command will show QLogic driver threads:

# ps -eo comm,pid | grep qla qla2xxx_6_dpc 942 qla2xxx_7_dpc 947

The driver thread should be pinned to the set of cores listed in the device local_cpulist.

23

Page 28: ION performance brief   hp dl980-8b

On the DL980, although every I/O hub is local to two NUMA nodes, only the CPU cores from the lower numbered node are shown as local to each PCI device. In this example, the first range (20-29) corresponds to the CPU cores in NUMA node 2, and the second range (100-109) corresponds to the hyper-threading cores for NUMA node 2. The second CPU core range will only be present if hyper-threading is enabled. Though the device is also local to NUMA node 3, it is generally sufficient to pin all devices to one of the two NUMA nodes, provided there are enough CPU resources on a single node. Splitting pinning between the two nodes requires extreme precision. Pinning resources from one device on two separate nodes can create poor performance. Though both nodes may be local to the device, they are not local to each other.

These settings will not persist through a reboot.

24

Page 29: ION performance brief   hp dl980-8b

Oracle Tuning

________________________________________________________________________

The following settings are specific to tuning for Oracle. A reboot must be applied in order for system settings to take effect.

HUGEPAGES Configuring HugePages reduces the overhead of utilizing large amounts of memory by reducing the page table size of the Oracle System Global Area (SGA). The default HugePage size is 2 MB, compared with the typical page size of 4 KB. With a page size of 2 MB, a 10 GB SGA will have only 5120 pages compared to 2.6 million pages without HugePages.

HugePages can be configured in /etc/sysctl.conf:

vm.nr_hugepages=55612 vm.hugetlb_shm_group=501

The number of HugePages used here is based on a recommendation from Oracle. The group should be set to the group ID of Oracle. This can be determined using the id command.

# id –g oracle 501

After a reboot, the number of available HugePages can be verified.

# cat /proc/meminfo | grep HugePages_Total HugePages_Total: 55612

SYSCTL PARAMETERS The following parameters were configured for Oracle in /etc/sysctl.conf:

kernel.shmmni = 4096 kernel.sem = 250 32000 100 128 net.core.rmem_default = 4194304 net.core.rmem_max = 4194304

25

Page 30: ION performance brief   hp dl980-8b

net.core.wmem_default = 262144 net.ipv4.ip_local_port_range = 9000 65500 fs.file-max = 6815744 net.core.wmem_max = 1048576 fs.aio-max-nr = 1048576

ORACLE INITIALIZATION PARAMETERS The following parameters were set in the /opt/oracle/product/11.2.0/dbs/initorcl.ora file:

*.db_block_size=8192 *.db_recovery_file_dest_size=2000G *.processes=6000 *.db_writer_processes=16 *.dml_locks=80000 *.filesystemio_options='SETALL' *.open_cursors=8192 *.optimizer_capture_sql_plan_baselines=FALSE *.parallel_degree_policy='AUTO' *.parallel_threads_per_cpu=2 *.pga_aggregate_target=8G *.sga_max_size=50G *.sga_target=50G *.use_large_pages='only' _enable_NUMA_support=TRUE

The _enable_NUMA_support parameter enables Oracle NUMA optimizations.

The use_large_pages parameter ensures that each NUMA segment will be backed by HugePages.

26

Page 31: ION performance brief   hp dl980-8b

fio Performance Testing ________________________________________________________________________

After performing the configuration described in this document, the fio tool can be used to verify the synthetic performance of the ION Data Accelerator configuration.

PRECONDITIONING FLASH STORAGE Running tests immediately after a low-level format of the flash storage is not a meaningful test for the ION Data Accelerator system or any other flash-based storage system.

It is always recommended that preconditioning be performed prior to measuring performance. When comparing multiple flash storage solutions, it is necessary to perform the same preconditioning on each system. Improper preconditioning can lead to extremely unrealistic performance comparisons.

Preconditioning can be performed by writing a random data pattern to the entire address range of the device, using a consistent block size. A block size of 1MB is recommended.

TESTING THREAD CPU AFFINITY Earlier, this document described how to align all I/O to a given LUN on a single socket. This was done by HBA placement, restricted LUN access, target-initiator connections, IRQ affinity, and driver thread affinity. The final component is to force the test threads accessing that LUN onto the same NUMA node as all of the other components. Configuring this will vary depending on the test used. For the fio test, the cpus_allowed parameter can be used as shown in the examples below.

TEST COMMANDS The iontuner RPM provides a script that may be used to generate fio job files with optimal NUMA tuning parameters. The RPM is made available on the Fusion-io internal network in the same location as this document:

27

Page 32: ION performance brief   hp dl980-8b

https://confluence.int.fusionio.com/display/ION/Documentation#Documentation-IONPerformanceBrief,HPDL980(INTERNAL-ONLY)

A fio job file can be created using the following command format:

# /usr/sbin/iontuner.py --setupfio=’<parameters>’

The script generates a job file using fio parameters that have been shown to provide optimal performance results. They also provide efficient pinning for all test threads. In addition to the built-in parameters, options specified in the <parameters> field as a comma-separated list are also added to the job file. This option should be used to specify read/write balance, random vs. sequential I/O, test length, and any other parameters specific to the workload being tested.

For example, the following command can be used to generate a random 4KB read test:

# /usr/sbin/iontuner.py --setupfio='rw=randrw,bs=4k,rwmixread=100,runtime=600,loops=10000,numjobs=1'

This command generates the following job file in /root/iontuner-fio.ini:

[global] rw=randrw bs=4k rwmixread=100 runtime=600 loops=10000 numjobs=1 iodepth=256 group_reporting=1 thread=1 exitall=1 sync=0 direct=1 randrepeat=0 norandommap=1 ioengine=libaio gtod_reduce=1 iodepth_batch=64 iodepth_batch_complete=64 iodepth_batch_submit=64 [dm-10] filename=/dev/dm-10 offset=0 size=8409579520 cpus_allowed=20,21,22,23,24,25,26,27,28,29,100,101,102,103,104,105,106,107,108,109 [dm-8] filename=/dev/dm-8 offset=0 size=8409579520 cpus_allowed=20,21,22,23,24,25,26,27,28,29,100,101,102,103,104,105,106,107,108,109

28

Page 33: ION performance brief   hp dl980-8b

[dm-9] filename=/dev/dm-9 offset=0 size=8409579520 cpus_allowed=0,1,2,3,4,5,6,7,8,9,80,81,82,83,84,85,86,87,88,89 [dm-6] filename=/dev/dm-6 offset=0 size=8409579520 cpus_allowed=20,21,22,23,24,25,26,27,28,29,100,101,102,103,104,105,106,107,108,109 [dm-7] filename=/dev/dm-7 offset=0 size=8409579520 cpus_allowed=0,1,2,3,4,5,6,7,8,9,80,81,82,83,84,85,86,87,88,89 [dm-4] filename=/dev/dm-4 offset=0 size=8409579520 cpus_allowed=20,21,22,23,24,25,26,27,28,29,100,101,102,103,104,105,106,107,108,109 [dm-5] filename=/dev/dm-5 offset=0 size=8409579520 cpus_allowed=0,1,2,3,4,5,6,7,8,9,80,81,82,83,84,85,86,87,88,89 [dm-3] filename=/dev/dm-3 offset=0 size=8409579520 cpus_allowed=0,1,2,3,4,5,6,7,8,9,80,81,82,83,84,85,86,87,88,89

The numjobs parameter must be tuned specifically for each configuration. Though one job per volume was optimal in this configuration, for ION Data Accelerator configurations with many ioDrives it may be necessary to use four or more jobs per volume to achieve maximum performance.

The cpus_allowed parameter is used to specify a list of CPUs on which each test thread may run. Earlier sections of this document described how to align all I/O to a given volume on a single socket by HBA placement, restricted LUN access, target-initiator connections, IRQ affinity, and driver thread affinity. This final component forces the test threads accessing that volume onto the same NUMA node as all of the other components.

To manually determine which CPUs a multipath device should be pinned to, first the host number must be obtained from the multipath command:

# multipath –l mpathgzu (26364646430613766) dm-3 FUSIONIO,ION LUN size=174G features='3 queue_if_no_path pg_init_retries 50' hwhandler='0' wp=rw

29

Page 34: ION performance brief   hp dl980-8b

`-+- policy='queue-length 0' prio=0 status=active |- 2:0:0:0 sdm 8:192 active undef running `- 1:0:0:0 sdg 8:96 active undef running ...

The first number listed with each underlying sd* device indicates the host number. The host number can be correlated to a PCI device by looking in sysfs:

# ls -d /sys/bus/pci/devices/*/host* /sys/bus/pci/devices/0000:11:00.1/host1 /sys/bus/pci/devices/0000:0b:00.0/host2 ...

The CPUs local to each PCI device can also be found in sysfs:

# cat /sys/bus/pci/devices/0000:11:00.1/local_cpulist 0-9,80-89 # cat /sys/bus/pci/devices/0000:0b:00.0/local_cpulist 0-9,80-89

If the devices are pathed properly, the local CPU list for each underlying device should be identical. These CPUs should be listed in the cpus_allowed parameter of fio.

Information on the other fio parameters used here is available in the fio man page.

In addition to creating a job file, the script will output the command that can be used to run a fio test with the job file. To run the test, copy the output of the script onto the command line:

# fio ./iontuner-fio.ini

The fio test will execute and generate test results to the terminal.

RESULTS The following fio test results are captured in this section, all on the HP DL980 initiator:

• Sequential R/W throughput and IOPS

• Random mix R/W IOPS

• Random mix R/W throughput

All tests were performed with the following elements:

• 3 x 2.41TB ioDrive2 Duos

• 1 x RAID 0 pool

• 8 ION volumes, 2 LUNs per volume

30

Page 35: ION performance brief   hp dl980-8b

• 8 direct-connect FC8 target-initiator links, 2 LUNs per initiator-target link

• 1 dm-multipath device per volume

• 1 worker/device, queue depth=256/worker

Preconditioning was performed prior to the set of tests for each block size by using fio to write to the entire range of the device with a 1 MB block size.

SEQUENTIAL R/W THROUGHPUT AND IOPS

31

Page 36: ION performance brief   hp dl980-8b

RANDOM MIX R/W IOPS

RANDOM MIX R/W THROUGHPUT

32

Page 37: ION performance brief   hp dl980-8b

The results above indicate performance measured and reported by fio, and for selected tests the numbers were compared with the output of the iostat command. The numbers were comparable.

Performance results can vary dramatically depending on the number of ION Data Accelerator volumes used, the number of paths to each volume, and the number of test threads run per volume (determined by the fio numjobs parameter). For this particular configuration, tests were run on a variety of volume, path, and thread counts before determining that 8 volumes, 2 paths per volume, and 1 thread per volume was optimal. This configuration was chosen because it provided the best results for random read IOPS. Depending on the specifics of a configuration and the workload chosen for optimization, other combinations may provide better results.

The above tests report the fastest random read IOPS at around 700,000 IOPS. However, to test initiator capabilities, some benchmarks were performed immediately after formatting the ioDrives. For example, this test achieved 800,000 IOPS:

# /usr/sbin/iontuner.py --setupfio='rw=randrw,bs=4k,rwmixread=100,runtime=600,loops=10000,numjobs=1'

Running immediately after a format is not a meaningful test for the ION Data Accelerator system itself, as reads are not serviced out of flash. Still, this indicated that given more ioDrives in the ION Data Accelerator, it is likely the DL980 could have achieved even higher performance numbers.

Similarly, the fastest reported combined read and write bandwidth is 6900 MB/s. Shortly after the cards were formatted, greater throughput was possible from the initiator:

# /usr/sbin/iontuner.py --setupfio='rw=randrw,bs=1m,rwmixread=50,runtime=600,loops=10000,numjobs=1’

This test achieved 3740 MB/s read bandwidth and 3750 MB/s write bandwidth, for a total bandwidth of 7490 MB/s.

A final indicator of performance limited by the ioDrives is reduced mixed bandwidth performance at some block sizes. This is comparable to test results seen with a single ioDrive in a local server.

Writing data to the full address range prior to testing is a necessary step to achieve realistic results with an ION Data Accelerator test. These final tests are proof that it is unlikely that the NUMA architecture of the DL980 was the limiting factor in these fio results. The DL980 appeared to fully exercise the performance capabilities of the ION Data Accelerator.

33

Page 38: ION performance brief   hp dl980-8b

Oracle Performance Testing ________________________________________________________________________

Oracle Orion is a tool for predicting the performance of an Oracle database without having to install Oracle or create a database. It simulates Oracle database I/O workloads using the same I/O software stack as Oracle.

Tuning for Orion is very similar to tuning for fio. By running simultaneous copies of Orion’s advanced test, it is possible to approximate workloads similar to fio. Alternatively, the Online Transaction Processing (OLTP) and Data Warehouse (DSS) tests can be used to attempt to synthetically approximate user workloads. Orion can also be used to test mixed large and small block sizes.

TEST SETUP The Orion tests were run as root, but it was necessary to set the ORACLE_HOME environmental variable. To find this variable, run the following commands from an Oracle user shell:

# su – oracle $ echo $ORACLE_HOME /opt/oracle/product/11.2.0/db_1 $ exit

To make the variable permanent, run the following command in the terminal or add it to ~/.bashrc (the specific Oracle version will vary):

# ORACLE_HOME=/opt/oracle/product/11.2.0/db_1

The iontuner RPM provides a script that can be used to generate Orion test commands with optimal NUMA tuning parameters. The RPM is available on the Fusion-io internal network in the same location as this document:

https://confluence.int.fusionio.com/display/ION/Documentation#Documentation-IONPerformanceBrief,HPDL980(INTERNAL-ONLY)

34

Page 39: ION performance brief   hp dl980-8b

Orion .lun files can be created using the following command:

# /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='<parameters>’

The script generates commands that have been shown to provide optimal performance results and efficient pinning for all test threads.

For example, the following command can be used to generate a 4KB read IOPS test:

# /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 0 -size_small 4 -duration 600'

The script generates .lun files saved in the current directory and outputs the following commands:

taskset -c 20-29,100-109 /opt/oracle/product/11.2.0/bin/orion -testname iontuner-dm-6 -run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 0 -size_small 4 -duration 600 & taskset -c 20-29,100-109 /opt/oracle/product/11.2.0/bin/orion -testname iontuner-dm-7 -run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 0 -size_small 4 -duration 600 & taskset -c 20-29,100-109 /opt/oracle/product/11.2.0/bin/orion -testname iontuner-dm-4 -run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 0 -size_small 4 -duration 600 & taskset -c 20-29,100-109 /opt/oracle/product/11.2.0/bin/orion -testname iontuner-dm-5 -run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 0 -size_small 4 -duration 600 & taskset -c 0-9,80-89 /opt/oracle/product/11.2.0/bin/orion -testname iontuner-dm-2 -run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 0 -size_small 4 -duration 600 & taskset -c 0-9,80-89 /opt/oracle/product/11.2.0/bin/orion -testname iontuner-dm-3 -run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 0 -size_small 4 -duration 600 & taskset -c 0-9,80-89 /opt/oracle/product/11.2.0/bin/orion -testname iontuner-dm-0 -run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 0 -size_small 4 -duration 600 & taskset -c 0-9,80-89 /opt/oracle/product/11.2.0/bin/orion -testname iontuner-dm-1 -run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 0 -size_small 4 -duration 600 &

For this configuration, the best results were obtained by creating a separate .lun file for each volume and running a single Orion test on each volume. Splitting the volumes into separate .lun files made it possible for taskset to run each Orion test and assign it affinity to the CPUs local to the devices being tested. The local CPUs can be determined with the multipath command using the same method described in FIO Test Commands later in this document.

You can copy and paste the taskset commands into the terminal to run them in parallel. Because the output from Orion displays only the maximum performance of each instance (which may individually occur at different times), the iostat command should be used to read performance as viewed from the initiator devices:

35

Page 40: ION performance brief   hp dl980-8b

# iostat –x /dev/dm-*

TEST COMMANDS The fio tests used for 8KB IOPS were approximated with the following commands:

# /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type seq -num_large 0 -num_small 2048 -write 100 -size_small 8 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 0 -size_small 8 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 100 -size_small 8 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 75 -size_small 8 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 50 -size_small 8 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 25 -size_small 8 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 0 -num_small 2048 -write 0 -size_small 8 -duration 600'

The fio tests used for 512KB bandwidth were approximated with the following commands:

# /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type seq -num_large 2048 -num_small 0 -write 100 -size_large 512 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 2048 -num_small 0 -write 0 -size_large 512 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 2048 -num_small 0 -write 100 -size_large 512 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 2048 -num_small 0 -write 75 -size_large 512 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 2048 -num_small 0 -write 50 -size_large 512 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 2048 -num_small 0 -write 25 -size_large 512 -duration 600' # /usr/sbin/iontuner.py --oraclehome=$ORACLE_HOME --setuporion='-run advanced -matrix point –type rand -num_large 2048 -num_small 0 –write 0 -size_large 512 -duration 600'

For running the DSS test, the iontuner.lun file was created with all eight volumes specified. The

36

Page 41: ION performance brief   hp dl980-8b

DSS test was run with the following command:

# taskset -c 0-9,80-89,20-29,100-109 ./orion -testname iontuner -run dss

Because all devices were used in a single command, the CPUs local to all of the HBAs were specified to taskset.

The OLTP test was run with the following command:

# taskset -c 0-9,80-89,20-29,100-109 ./orion -testname iontuner -run oltp

RESULTS When running Orion advanced tests that approximated fio tests for 8KB and 512KB block sizes, the results were almost identical to fio.

There was more variation between runs than between the two utilities. Because the previous state of the ioDrives has a large impact on the performance of any test, it is necessary when comparing test runs to sequence tests in a consistent order and begin with the same initial ioDrive conditioning. Providing Orion results for these tests would only bring attention to minor variations that provide no additional information about the tuning of the DL980.

Additionally with the advanced tests, there was an unexpected behavior of Orion: for block sizes larger than 512KB, it seems that 512KB accesses are always generated to the devices.

The DSS test resulted in a maximum bandwidth of 6039 MB/s.

There are many variations to the Orion test that could be experimented with. To get an accurate measurement of maximum performance, it is necessary to run multiple copies of the test and evaluate the results from iostat. With any of the test options that run multiple test points (advanced, OLTP, DSS) there is no guarantee that all of the test copies will synchronously run each test point. This may invalidate results.

37

Page 42: ION performance brief   hp dl980-8b

Oracle Database Testing ________________________________________________________________________

For Oracle database testing, a number of tools were used to show the maximum capabilities of the system under a variety of workloads.

READ WORKLOAD TEST – QUEST BENCHMARK FACTORY For a more realistic Oracle test, a Windows server was connected to the DL980 via an additional Fibre Channel link. An Oracle disk group was created containing all of the ION Data Accelerator volumes.

Quest Benchmark Factory was used to create a database on the disk group with the following configuration:

• Size: 300GB

• Logging Mode: ARCHIVELOG

The Oracle components below were placed in one ASM disk group, +DATA, which consisted of 8 LUNs (each 800 GB) enabled with multipathing:

• Redo – 20 redo log members, each 2048 MB in size

• Archivelogs – placed in the default FRA

• FRA – db_recovery_file_dest=’+DATA’, db_recovery_file_dest_size=’3000G’

• UNDO, data, and temporary tablespaces

The ASM +DATA disk group was created with external redundancy and with a default 1MB AU size. SYS, SYSTEM, and second UNDO tablespaces were created in the ADMIN disk group. This was done in order to easily drop and recreate the TEST data and disk groups without having to recreate the database.

38

Page 43: ION performance brief   hp dl980-8b

For a read workload test, Quest Benchmark Factory > Database Scalability Job > TPC-H Power Test was used.

39

Page 44: ION performance brief   hp dl980-8b

The test was configured for 50 users.

40

Page 45: ION performance brief   hp dl980-8b

Performance was evaluated on the DL980 while TPC-H Power Test was running.

Oracle Enterprise Manager was used to show read bandwidth during the test.

During the test Oracle showed a read bandwidth of just over 6000 MB/s.

An Automated Workload Repository (AWR) report was generated during the test. The following excerpts provide details on the I/O performed by the test.

41

Page 46: ION performance brief   hp dl980-8b

The AWR report function summary shows a total read bandwidth of 5.8 GB/s averaged over the length of the test.

The file statistics show the breakdown of I/O for each file.

42

Page 47: ION performance brief   hp dl980-8b

Using ‘iostat –mx /dev/dm-*’, a snapshot of bandwidth from the ION volumes was verified.

An approximate read bandwidth of 755MB/s was seen on each of the eight volumes, for a total read bandwidth of 6043MB/s from the ION Data Accelerator server. The avgrq-sz column shows that the average request size was between 512 and 1024 sectors (256 KB and 512 KB). These results are consistent with the bandwidth of approximately 6100MB/s seen from fio in this block size range. However, it is important to recognize that Oracle performs data transfers of many sizes simultaneously, so the synthetic fixed block size results of fio are not a direct comparison, only an approximation of the capability at this workload.

OLTP WORKLOAD TEST – HEAVY INSERT SCRIPT Performance was evaluated while running a custom OLTP load generated by a script running heavy insert database transactions on the DL980.

Oracle Enterprise Manager was used to show bandwidth and IOPS during the test.

43

Page 48: ION performance brief   hp dl980-8b

During the test Oracle showed a total bandwidth of approximately 4000 MB/s.

44

Page 49: ION performance brief   hp dl980-8b

An AWR report was generated during the test. The following excerpts provide details on the I/O performed by the test.

The AWR report function summary shows a total read bandwidth of 884 MB/s and write bandwidth of 2.6 GB/s averaged over the length of the test, or 3.5 GB/s combined.

The file statistics show the breakdown of I/O for each file.

45

Page 50: ION performance brief   hp dl980-8b

Using ‘iostat –mx /dev/dm-*’, a snapshot of bandwidth from the ION volumes was verified.

A read bandwidth of 952 MB/s and write bandwidth of 2505 MB/s was seen, for a total bandwidth of 3457 MB/s from the ION Data Accelerator server. The workload is 22% read and 78% write I/O. The avgrq-sz column shows that the average request size was around 123 sectors, or 61KB. The result from the fio test for a 25% read workload and 64KB block size was 3705MB/s, which is consistent with the results of this test. Once again, it is important to recognize that Oracle performs data transfers of many sizes simultaneously, so the synthetic fixed block size results of fio are not a direct comparison, only an approximation of the capability at this workload.

46

Page 51: ION performance brief   hp dl980-8b

TRANSACTIONS TEST – SWINGBENCH An Order Entry Sample OLTP Test was run in Swingbench on the DL980. The test was configured with 100 users and transaction delay disabled. Because of some difficulties with Swingbench that were not related to performance, hyper-threading was disabled for this test.

The test resulted in an average of 934,359 transactions per minute (TPM) and a maximum of 1,150,103 TPM.

Oracle transactions vary greatly in the I/O they produce on the backend storage. A specific TPM number such as the one provided by Swingbench is only useful when compared to a number produced by a Swingbench test with the same parameters.

47

Page 52: ION performance brief   hp dl980-8b

Conclusions ________________________________________________________________________

Prior to tuning, it is possible that performance on a NUMA system such as the HP DL980 will appear to be lower than that of systems with less complex architectures. The script used throughout this document for NUMA-specific tuning will be made available to simplify and standardize this tuning process.

Synthetic benchmarks such as fio or Orion provide direct measurement of ION Data Accelerator storage capabilities. The flexibility of these tools is extremely useful when tuning storage configurations and initiator system parameters. The comparable results achieved by fio and Orion indicate that either of these tools is sufficient. The configuration used at Fusion-io in San Jose was capable of sustaining 700,000 random IOPS and up to 7GB/s in bandwidth, but there were indicators that the DL980 would have been capable of sustaining even greater numbers when used in combination with more ioDrives in the ION Data Accelerator.

However, synthetic benchmark performance alone does not guarantee user application performance. Additional system parameters must be tuned for Oracle, and appropriate tests must be used to identify the maximum performance for each specific workload. Oracle produced a read bandwidth of up to 6GB/s and a mixed bandwidth of nearly 3.5GB/s. While these numbers may seem to be lower than those seen by fio, they are very comparable to the results of an fio test with a similar read/write balance and average block size. The close proximity of the Oracle results to the fio results indicates that Oracle has been tuned to take full advantage of the performance of the storage. Tests in Swingbench were measured at up to 1,150,103 TPM, but this number is only useful when compared to other Swingbench results.

NUMA support is an active topic in Linux development. As newer distributions become available and their built-in tools improve, it is likely that less manual tuning will be necessary. While tuning with this script provided is not currently persistent, methods are being investigated to provide automatic tuning at boot time as well as upon device discovery. When configured properly, the DL980 is a very powerful Oracle initiator for use with the ION Data Accelerator.

48

Page 53: ION performance brief   hp dl980-8b

Glossary ________________________________________________________________________

Initiator - An initiator of I/O is analogous to a client in a client/server system. Initiators use a SCSI transport protocol to access block storage over a network. A database or mail server is an initiator, for example.

LUN – Logical Unit Number. Targets furnish containers for I/O that are a contiguous array of blocks identified by logical unit number. A LUN is usually synonymous with physical disk drive, since initiators perceive it as such. For ION Data Accelerator, a LUN is a volume that has been exported to one or more initiators.

Pool –an aggregation of IoMemory or RAIDset block devices. Block devices can be added to a pool.

Target – the opposite of an initiator, is a receiver of I/O operations, analogous to a server in a client/server system. The target for I/O is the provider of (network) storage - a SAN disk array is a traditional target. ION Data Accelerator is an all-flash storage target by comparison.

Volume – a logical construct identifying a unit of data storage. A volume is allocated to allow for expandability within the space constraints of a pool. For ION Data Accelerator, a volume is not necessarily directly linked to a physical device.

49

Page 54: ION performance brief   hp dl980-8b

Appendix A: Tuning Checklist ________________________________________________________________________

The following is a complete checklist of the tuning steps described in the document that can be used as a quick reference:

1. Check initiator HBA slot locations.

2. Check ION storage profile.

3. Verify that a sufficient number of ION volumes are used.

4. Verify that a sufficient number of LUN paths are used.

5. Verify that LUN paths are distributed so all fabric resources are balanced.

6. Verify that all LUNs for each volume are presented only to HBAs within one NUMA node.

7. Update the BIOS and verify that NUMA distances are detected properly.

8. Set the BIOS power profile to Maximum Performance.

9. Verify that cstates are disabled in the BIOS.

10. Enable Hyperthreading in the BIOS settings.

11. Disable virtualization and VT-d in the BIOS if not needed.

12. Check the addressing mode in the BIOS.

13. Disable x2APIC in the BIOS.

14. Verify multipath path_selector is queue-length

15. Disable processor cstates with boot parameters.

16. Install the iontuner RPM (tunes block devices with udev rules, disables the cpuspeed

50

Page 55: ION performance brief   hp dl980-8b

daemon, disables the irqbalance daemon, and pins IRQs).

17. Use fio or Orion commands generated by iontuner when testing baseline performance.

18. Configure HugePages for Oracle.

19. Configure sysctl parameters for Oracle.

20. Configure Oracle initialization parameters, including _enable_NUMA_support and use_large_pages.

51

Page 56: ION performance brief   hp dl980-8b

Appendix B: Speeding up Oracle Database Performance with ioMemory – an HP Session

________________________________________________________________________

This appendix is adapted from a session presented at the HP ExpertOne Technology & Solutions Summit, Dec. 2012 in Frankfurt, Germany.

ARCHITECTURE OVERVIEW The diagram below shows the basic topology for shared NAND flash storage using the ION Data Accelerator connected to database servers.

Fabric

Node 1 Node 2

I/O bottlenecks in a shared storage system can be removed by strategically placing transaction logs, the TempDB, hot (frequently accessed) tables, or the entire database on ioMemory in the ION Data Accelerator.

52

Page 57: ION performance brief   hp dl980-8b

ABOUT ION DATA ACCELERATOR An ION Data Accelerator system consists of the following basic components:

ION Data Accelerator Software – runs as a GUI or CLI, transforming tier 1 servers into an open shared flash resource. Up to 20x performance improvement has been achieved, compared to traditional disk-based shared storage systems.

Fusion ioMemory – is proven, tested, reliable, and fast, with thousands of satisfied customers worldwide.

Open System Platforms – ION Data Accelerator software runs on a variety of tier 1 servers, providing industry-leading performance, reliability, and capacity. Hundreds of thousands of these servers are deployed in enterprises today. Supported network protocols include Fibre Channel, SRP/InfiniBand, and iSCSI.

ION Data Accelerator Software

The ION Data Accelerator software running on the host server

• Is optimized for ioMemory

• Works on industry-standard servers

• Supports JBOD, RAID 0, and RAID 10 modes (including spare drives)

• Provides GUI, CLI, SMIS, and SNMP access

• Is easy to configure

• Enables software-defined storage

Fusion-Powered Storage Stack

The following diagram shows how the elements of a Fusion-powered software/hardware stack.

Your application

Transforms the server into a storage target

Virtual Storage Layer, a purpose-built flash access layer

Fast, reliable, cost-effective flash memory in a PCIe form factor

Tier 1 server Server

ioMemory

VSL

ION Software

Application

53

Page 58: ION performance brief   hp dl980-8b

Why ION Data Accelerator?

ION Data Accelerator provides the following advantages:

• It is a highly efficient shared storage target.

• With its low latency, high IOPS, and high bandwidth it can accelerate writes and reads in a variety of environments, including SAP, SQL, Navision, Oracle, VMware, etc.

• It outperforms even cache hits from storage array vendors.

Because of the increased performance that ION Data Accelerator achieves, customers can

• Support more concurrent users.

• Lower response times.

• Run queries and reports faster

• Finish batch jobs in shorter time

• Increase application stability

ABOUT ION DATA ACCELERATOR HA (HIGH AVAILABILITY) ION Data Accelerator enables a powerful and effective HA (High Availability) environment for your shared storage, when HA licensing is enabled.

54

Page 59: ION performance brief   hp dl980-8b

The diagram below shows basic LUN access (exported volumes) in an HA configuration.

LUN 0 LUN 0

LUN 1 LUN 1

LUN 0 LUN 1

40Gb

PERFORMANCE TEST RESULTS: HP DL380 / HP DL980 The following charts show performance results for an HP DL380 target running ION Data Accelerator, with an HP DL980 initiator.

55

Page 60: ION performance brief   hp dl980-8b

56

Page 61: ION performance brief   hp dl980-8b

OVERVIEW OF THE ION DATA ACCELERATOR GUI Summary Screen:

Creating a Storage Profile for the storage pool:

57

Page 62: ION performance brief   hp dl980-8b

Creating volumes from the storage pool:

Setting up an initiator group (LUN masking) to access volumes:

58

Page 63: ION performance brief   hp dl980-8b

Managing initiators:

Editing initiator access:

59

Page 64: ION performance brief   hp dl980-8b

Managing volumes:

COMPARATIVE SOLUTIONS The diagram below shows a winning solution for ION Data Accelerator and Oracle, compared with rival EMC:

3PAR T400

Oracle SGA: 700 GB

• HP DL980• Red Hat 6• 64 or 80 cores Intel E7• 1 TB memory

TempDBOther

apps & Table-

spaces

HP IO Accelerator

Redo LogsHot Tables

60

Page 65: ION performance brief   hp dl980-8b

The table below illustrates the competitive advantages of ION Data Accelerator:

Comparison Point ION Note

Open Systems Server Foundation ✔ Fusion-io relies on time tested open systems

server hardware while competitors are proprietary

Fusion-io Adaptive Flashback vs. Competitor RAID ✔ VSL with Adaptive Flashback provides two

orders of magnitude better media error rates

ION RAID vs. Competition ✔ ION provides more flexibility with JBOD, RAID-0, RAID-10 vs. one static configuration option

Street Price ($/GB) ✔ Fusion-io delivers a solution estimated to be at least 30% lower cost/GB

Price/IOPS ✔ Fusion-io is the clear winner

Power ✔ Fusion-io draws less power

BEST PRACTICES The following best practices are important to follow in order to achieve top performance for Oracle testing.

• Present 16 to 32 LUNs to the host for maximum performance.

• Use the noop scheduler.

• Use round robin for multipath.conf.

• When using a DL980 as load generator, make sure you pin the I/O issuing processes.

• It doesn’t matter so much on which nodes the processes are pinned, as long as they are pinned.

61

Page 66: ION performance brief   hp dl980-8b

The maximum performance configuration shown below achieved about 700K IOPS.

DL 980IOH 1

IOH 2

CPU 0

CPU 1

CPU 2

CPU 3

HBA 1

HBA 2

HBA 3

HBA 4

IONSwitch

HBA 1

HBA 2

BENCHMARK TEST CONFIGURATION Below is a proof-of-concept configuration that can be extended in any direction:

A single server can achieve 600K IOPS at a 4KB block size.

Below are system configurations for the storage server (ION Data Accelerator appliance) and the database server.

Storage Server

• DL 380p Gen8, 2 socket, 2.9GHz

• 4 x 2.4TB HP IO Accelerator

• 2 x dual-port 8Gbit Fibre Channel

62

Page 67: ION performance brief   hp dl980-8b

Database Server

• DL980 G7 8s /80c, 1TB RAM

• 4 x dual-port 8Gbit Fibre Channel

RAW PERFORMANCE TEST RESULTS WITH FIO

Total IOPS

1

2

48

1632

64128

0

100000

200000

300000

400000

500000

600000

700000

1 2 48

1632

64

# of

Job

s

IOPS

Qdepth

ION Data Accelerator with RAID 0, 2 RAIDSETS, 32 LUNs at 4KB block size, 100% read

63

Page 68: ION performance brief   hp dl980-8b

Average Completion Latency (Microseconds)

0

100000

200000

300000

400000

500000

600000

700000

0

100

200

300

400

500

600

700

800

900

1000

1 2 4 8 16 32 64 128

IOPS

Late

ncy

(µs)

# of Jobs

Latency (µs)

IOPS

ION Data Accelerator with RAID 0, 2 RAIDSETS, 32 LUNs at 4KB block size, 100% read, Qdepth = 4

Raw I/O Test: 70% Read, 30% Write

ION Data Accelerator with RAID 0, 2 RAIDSETS, 16 LUNs at 4KB block size

64

Page 69: ION performance brief   hp dl980-8b

Raw I/O Test: 100% Read at 8KB

1

24

816

3264128

0

100000

200000

300000

400000

500000

1 2 4 816

3264

400000-500000

300000-400000

200000-300000

100000-200000

0-100000

# of Jobs

ION Data Accelerator with RAID 0, 2 RAIDSETS, 32 LUNs at 8KB block size

Raw I/O Test: Read Latency (Microseconds)

1

2

48

1632

64128

02000400060008000

1000012000140001600018000

1 2 4 816

3264

16000-18000

14000-16000

12000-14000

10000-12000

8000-10000

6000-8000

4000-6000

2000-4000

0-2000

ION Data Accelerator with RAID 0, 2 RAIDSETS, 32 LUNs at 8KB block size

65

Page 70: ION performance brief   hp dl980-8b

ORACLE WORKLOAD TESTS The following configuration was used for Oracle workload testing:

Database

• 1TB of data

• Tables from million to billion rows

Data Access Pattern

• Sequential write

• Data load (bulk load, real-time)

• Full table scan

• Select data via index

• Update data via index

MB/sec

66

Page 71: ION performance brief   hp dl980-8b

processes

IOPS

processes

~2.2 GB/sec random read

67

Page 72: ION performance brief   hp dl980-8b

IOPS

processes

Up to 2.5 GB/sec writeUp to 300 MB/sec redologCPU Load 21 % max

Load generator: hammerora fromhttp://hammerora.sourceforge.net

1 TB DB size80 users 10ms delay

68

Page 73: ION performance brief   hp dl980-8b

Cpu load: 33%Almost no IO wait!!!

69