tuning guides - huawei cloud

120
Kunpeng BoostKit for SDS Tuning Guide Issue 10 Date 2021-09-13 HUAWEI TECHNOLOGIES CO., LTD.

Upload: others

Post on 03-Oct-2021

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tuning Guides - HUAWEI CLOUD

Kunpeng BoostKit for SDS

Tuning Guide

Issue 10

Date 2021-09-13

HUAWEI TECHNOLOGIES CO., LTD.

Page 2: Tuning Guides - HUAWEI CLOUD

Copyright © Huawei Technologies Co., Ltd. 2021. All rights reserved.

No part of this document may be reproduced or transmitted in any form or by any means without priorwritten consent of Huawei Technologies Co., Ltd. Trademarks and Permissions

and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.All other trademarks and trade names mentioned in this document are the property of their respectiveholders. NoticeThe purchased products, services and features are stipulated by the contract made between Huawei andthe customer. All or part of the products, services and features described in this document may not bewithin the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements,information, and recommendations in this document are provided "AS IS" without warranties, guaranteesor representations of any kind, either express or implied.

The information in this document is subject to change without notice. Every effort has been made in thepreparation of this document to ensure accuracy of the contents, but all statements, information, andrecommendations in this document do not constitute a warranty of any kind, express or implied.

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. i

Page 3: Tuning Guides - HUAWEI CLOUD

Contents

1 Using the Kunpeng Hyper Tuner for Tuning..................................................................... 1

2 Ceph Block Storage Tuning Guide....................................................................................... 22.1 Introduction............................................................................................................................................................................... 22.1.1 Components........................................................................................................................................................................... 22.1.2 Environment.......................................................................................................................................................................... 52.1.3 Tuning Guidelines and Process Flow............................................................................................................................. 72.2 General-Purpose Storage................................................................................................................................................... 102.2.1 Hardware Tuning............................................................................................................................................................... 102.2.2 System Tuning.................................................................................................................................................................... 102.2.3 Ceph Tuning........................................................................................................................................................................ 172.2.4 KAE zlib Compression Tuning........................................................................................................................................ 252.3 High-Performance Storage................................................................................................................................................ 282.3.1 Hardware Tuning............................................................................................................................................................... 282.3.2 System Tuning.................................................................................................................................................................... 282.3.3 Ceph Tuning........................................................................................................................................................................ 33

3 Ceph Object Storage Tuning Guide...................................................................................413.1 Introduction............................................................................................................................................................................ 413.1.1 Overview...............................................................................................................................................................................413.1.2 Environment........................................................................................................................................................................ 433.1.3 Tuning Guidelines and Process Flow...........................................................................................................................463.2 Cold Storage........................................................................................................................................................................... 483.2.1 Hardware Tuning............................................................................................................................................................... 483.2.2 System Tuning.................................................................................................................................................................... 493.2.3 Ceph Tuning........................................................................................................................................................................ 553.3 General-Purpose Storage................................................................................................................................................... 633.3.1 Hardware Tuning............................................................................................................................................................... 633.3.2 System Tuning.................................................................................................................................................................... 633.3.3 Ceph Tuning........................................................................................................................................................................ 703.3.4 KAE zlib Compression Tuning........................................................................................................................................ 783.4 High-Performance Storage................................................................................................................................................ 823.4.1 Hardware Tuning............................................................................................................................................................... 823.4.2 Ceph Tuning........................................................................................................................................................................ 82

Kunpeng BoostKit for SDSTuning Guide Contents

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. ii

Page 4: Tuning Guides - HUAWEI CLOUD

3.4.3 KAE MD5 Digest Algorithm Tuning.............................................................................................................................85

4 Ceph File Storage Tuning Guide........................................................................................ 884.1 Introduction............................................................................................................................................................................ 884.1.1 Components........................................................................................................................................................................ 884.1.2 Environment........................................................................................................................................................................ 914.1.3 Tuning Guidelines and Process Flow...........................................................................................................................934.2 General-Purpose Storage................................................................................................................................................... 954.2.1 Hardware Tuning............................................................................................................................................................... 964.2.2 System Tuning.................................................................................................................................................................... 964.2.3 Ceph Tuning...................................................................................................................................................................... 1034.2.4 KAE zlib Compression Tuning..................................................................................................................................... 111

A Change History....................................................................................................................115

Kunpeng BoostKit for SDSTuning Guide Contents

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. iii

Page 5: Tuning Guides - HUAWEI CLOUD

1 Using the Kunpeng Hyper Tuner forTuning

To tune the performance of components in the Kunpeng BoostKit for SDS, you canuse the Kunpeng Hyper Tuner. When creating an analysis project, selectDistributed Storage. For details, see Kunpeng Hyper Tuner.

Kunpeng BoostKit for SDSTuning Guide 1 Using the Kunpeng Hyper Tuner for Tuning

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 1

Page 6: Tuning Guides - HUAWEI CLOUD

2 Ceph Block Storage Tuning Guide

2.1 Introduction

2.2 General-Purpose Storage

2.3 High-Performance Storage

2.1 Introduction

2.1.1 Components

CephCeph is a distributed, scalable, reliable, and high-performance storage systemplatform that supports storage interfaces including block devices, file systems, andobject gateways. The optimization methods described in this document includehardware optimization and software configuration optimization. Software codeoptimization is not involved. By adjusting the system and Ceph configurationparameters, Ceph can fully utilize the hardware performance of the system. CephPlacement Group (PG) distribution optimization and object storage daemon(OSD) core binding aim to balance drive loads and prevent any OSD frombecoming a bottleneck. In addition, in general-purpose storage scenarios, usingNVMe SSDs as Bcache can also improve performance. Figure 2-1 shows the Cepharchitecture.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 2

Page 7: Tuning Guides - HUAWEI CLOUD

Figure 2-1 Ceph architecture

Table 2-1 describes the Ceph modules and components.

Table 2-1 Module functions

Module Function

RADOS Reliable Autonomic Distributed Object Store (RADOS) is theheart of a Ceph storage cluster. Everything in Ceph is stored byRADOS in the form of objects irrespective of their data types. TheRADOS layer ensures data consistency and reliability throughdata replication, fault detection and recovery, and data recoveryacross cluster nodes.

OSD Object storage daemons (OSDs) store the actual user data. EveryOSD is usually bound to one physical drive. The OSDs handle theread/write requests from clients.

MON The monitor (MON) is the most important component in a Cephcluster. It manages the Ceph cluster and maintains the status ofthe entire cluster. The MON ensures that related components ofa cluster can be synchronized at the same time. It functions asthe leader of the cluster and is responsible for collecting,updating, and publishing cluster information. To avoid singlepoints of failure (SPOFs), multiple MONs are deployed in a Cephenvironment, and they must handle the collaboration betweenthem.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 3

Page 8: Tuning Guides - HUAWEI CLOUD

Module Function

MGR The manager (MGR) is a monitoring system that providescollection, storage, analysis (including alarming), andvisualization functions. It makes certain cluster parametersavailable for external systems.

Librados Librados is a method that simplifies access to RADOS. Currently,it supports programming languages PHP, Ruby, Java, Python, C,and C++. It provides RADOS, a local interface of the Ceph storagecluster, and is the base component of other services such as theRADOS block device (RBD) and RADOS gateway (RGW). Inaddition, it provides the Portable Operating System Interface(POSIX) for the Ceph file system (CephFS). The Librados API canbe used to directly access RADOS, enabling developers to createtheir own interfaces for accessing the Ceph cluster storage.

RBD The RADOS block device (RBD) is the Ceph block device thatprovides block storage for external systems. It can be mapped,formatted, and mounted like a drive to a server.

RGW The RADOS gateway (RGW) is a Ceph object gateway thatprovides RESTful APIs compatible with S3 and Swift. The RGWalso supports multi-tenant and OpenStack Identity service(Keystone).

MDS The Ceph Metadata Server (MDS) tracks the file hierarchy andstores metadata used only for CephFS. The RBD and RGW do notrequire metadata. The MDS does not directly provide dataservices for clients.

CephFS The CephFS provides a POSlX-compatible distributed file systemof any size. It depends on the Ceph MDS to track the filehierarchy, namely the metadata.

Vdbench

Vdbench is a command line utility designed to help engineers and customersgenerate drive I/O loads for verifying storage performance and data integrity. Youcan also specify Vdbench execution parameters by entering text files.

Vdbench has many parameters. Table 2-2 lists some important commonparameters.

Table 2-2 Common parameters

Parameter

Description

-f Specifies a script file for the pressure test.

-o Specifies the path for exporting a report. The default value is thecurrent path.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 4

Page 9: Tuning Guides - HUAWEI CLOUD

Parameter

Description

lun Specifies the LUN device or file to be tested.

size Specifies the size of the LUN device or file to be tested.

rdpct Specifies the read percentage. The value 100 indicates full read, andthe value 0 indicates full write.

seekpct Specifies the percentage of random data. The value 100 indicates allrandom data, and the value 0 indicates sequential data.

elapsed Specifies the duration of the current test.

2.1.2 Environment

Physical NetworkingThe physical environment of the Ceph block devices contains two network layersand three nodes. In the physical environment, the MON, MGR, MDS, and OSDnodes are deployed together. At the network layer, the public network is separatedfrom the cluster network. The two networks use 25GE optical ports forcommunication.

Figure 2-2 shows the physical network.

Figure 2-2 Physical networking

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 5

Page 10: Tuning Guides - HUAWEI CLOUD

Hardware Configuration

Table 2-3 shows the Ceph hardware configuration.

Table 2-3 Hardware configuration

Server TaiShan 200 server (model 2280)

Processor Kunpeng 920 5230 processor

Core 2 x 32-core

CPU frequency 2600 MHz

Memory capacity 12 x 16 GB

Memory frequency 2666 MHz (8 Micron 2R memory modules)

NIC IN200 NIC (4 x 25GE)

Drive System drives: RAID 1 (2 x 960 GB SATA SSDs)Data drives of general-purpose storage: JBOD enabledin RAID mode (12 x 4 TB SATA HDDs)

NVMe SSD Acceleration drive of general-purpose storage: 1 x 3.2TB ES3600P V5 NVMe SSDData drives of high-performance storage: 12 x 3.2 TBES3600P V5 NVMe SSDs

RAID controller card Avago SAS 3508

Software Versions

Table 2-4 lists the required software versions.

Table 2-4 Software versions

Software Version

OS CentOS Linux release 7.6.1810

openEuler 20.03 LTS SP1

Ceph 14.2.x Nautilus

ceph-deploy 2.0.1

Vdbench 5.04.06

Node Information

Table 2-5 describes the IP network segment planning of the hosts.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 6

Page 11: Tuning Guides - HUAWEI CLOUD

Table 2-5 Node information

Host Type HostName

Public NetworkSegment

Cluster NetworkSegment

OSD/MON node Node 1 192.168.3.0/24 192.168.4.0/24

OSD/MGR node Node 2 192.168.3.0/24 192.168.4.0/24

OSD/MDS node Node 3 192.168.3.0/24 192.168.4.0/24

Component DeploymentTable 2-6 describes the deployment of service components in the Ceph blockdevice cluster.

Table 2-6 Component deployment

Physical MachineName

OSD MON MGR

Node 1 12 OSDs 1 MON 1 MGR

Node 2 12 OSDs 1 MON 1 MGR

Node 3 12 OSDs 1 MON 1 MGR

Cluster CheckRun the ceph health command to check the cluster health status. If HEALTH_OKis displayed, the cluster is running properly.

2.1.3 Tuning Guidelines and Process FlowThe block storage tuning varies with the hardware configuration.

● General-purpose storageHDDs are used as data drives, and solid state disks (SSDs) are configured asDB/WAL partitions and metadata storage pools.

● High-performance storageAll data drives are SSDs.

Perform the tuning based on your hardware configuration.

Tuning GuidelinesPerformance optimization must comply with the following principles:

● When analyzing the performance, analyze the system resource bottlenecksfrom multiple aspects. For example, insufficient memory capacity may causethe CPU to be occupied by memory scheduling tasks and the CPU usage toreach 100%.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 7

Page 12: Tuning Guides - HUAWEI CLOUD

● Adjust only one performance parameter at a time.● The analysis tool may consume system resources and aggravate certain

system resource bottlenecks. Therefore, the impact on applications must beavoided or minimized.

Tuning Process FlowThe tuning analysis flow is as follows:

1. In many cases, pressure test traffic is not completely sent to the backend(server). For example, a protection policy may be triggered on network accesslayer services such as Server Load Balancing (SLB), Web Application Firewall(WAF), High Defense IP, and even Content Delivery Network (CDN) /siteacceleration in a cloud-based architecture. This occurs because thespecifications, such as bandwidth, maximum number of connections, andnumber of new connections, are limited, or the pressure test shows thefeatures of Challenge Collapsar (CC) and Distributed Denial of Service (DDoS)attacks. As a result, the pressure test results do not meet expectations.

2. Check whether the key indicators meet the requirements. If not, locate thefault. The fault may be caused by the servers (in most cases) or the clients (ina few cases).

3. If the problem is caused by the servers, focus on the hardware indicators suchas the CPU, memory, drive I/O, and network I/O. Locate the fault and performfurther analysis on the abnormal hardware indicator.

4. If all hardware indicators are normal, check the middleware indicators such asthe thread pool, connection pool, and GC indicators. Perform further analysisbased on the abnormal middleware indicator.

5. If all middleware indicators are normal, check the database indicators such asthe slow query SQL indicators, hit ratio, locks, and parameter settings.

6. If the preceding indicators are normal, the algorithm, buffer, cache,synchronization, or asynchronization of the applications may be faulty.Perform further analysis.

Table 2-7 lists the possible bottlenecks.

Table 2-7 Possible bottlenecks

Bottleneck

Description

Hardware/Specifications

Problems of the CPU, memory, and drive I/O. The problems areclassified into server hardware bottlenecks and network bottlenecks(Network bottlenecks can be ignored in a LAN).

Middleware

Problems of software such as application servers and web servers, anddatabase systems. For example, a bottleneck may be caused ifparameters of the Java Database Connectivity (JDBC) connection poolconfigured on the WebLogic platform are set improperly.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 8

Page 13: Tuning Guides - HUAWEI CLOUD

Bottleneck

Description

Applications

Problems related to applications developed by developers. Forexample, when the system receives a large number of user requests,the following problems may cause low system performance, includingslow SQL statements and improper Java Virtual Machine (JVM)parameters, container settings, database design, program architectureplanning, and program design (insufficient threads for serial processingand request processing, no buffer, no cache, and mismatch betweenproducers and consumers).

OS Problems related to the OS such as Windows, UNIX, or Linux. Forexample, if the physical memory capacity is insufficient and the virtualmemory capacity is improper during a performance test, the virtualmemory swap efficiency may be greatly reduced. As a result, theresponse time is increased. This bottleneck is caused by the OS.

Networkdevices

Problems related to devices such as the firewalls, dynamic loadbalancers, and switches. Currently, more network access products areused in the cloud service architecture, including but not limited to theSLB, WAF, High Defense IP, CDN, and site acceleration. For example, ifa dynamic load distribution mechanism is set on the dynamic loadbalancer, the dynamic load balancer automatically sends subsequenttransaction requests to low-load servers when the hardware resourceusage of a server reaches the limit. If the dynamic load balancer doesnot function as expected in the test, the problem is a networkbottleneck.

General tuning procedure:

Figure 2-3 shows the general tuning procedure.

Figure 2-3 General tuning procedure

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 9

Page 14: Tuning Guides - HUAWEI CLOUD

2.2 General-Purpose Storage

2.2.1 Hardware Tuning

NVMe SSD Tuning● Purpose

Reduce cross-chip data overheads.● Procedure

Install the NVMe SSDs and NIC into the same riser card.

DIMM Installation Mode Tuning● Purpose

Populate one DIMM Per Channel (1DPC) to maximize the memoryperformance. That is, populate the DIMM 0 slot of each channel.

● ProcedurePreferentially populate DIMM 0 slots (DIMM 000, 010, 020, 030, 040, 050,100, 110, 120, 130, 140, and 150). Of the three digits in the DIMM slotnumber, the first digit indicates the CPU, the second digit indicates the DIMMchannel, and the third digit indicates the DIMM. Populate the DIMM slotswhose third digit is 0 in ascending order.

2.2.2 System Tuning

Optimizing the OS Configuration● Purpose

Adjust the system configuration to maximize the hardware performance.● Procedure

Table 2-8 lists the optimization items.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 10

Page 15: Tuning Guides - HUAWEI CLOUD

Table 2-8 OS configuration parameters

Parameter Description Suggestion ConfigurationMethod

vm.swappiness

The swappartition is thevirtual memoryof the system.Do not use theswap partitionbecause it willdeterioratesystemperformance.

Default value: 60Symptom: Theperformancedeterioratessignificantly whenthe swappartition is used.Suggestion:Disable the swappartition and setthis parameter to0.

Run the followingcommand:sudo sysctl vm.swappiness=0

MTU Maximum size ofa data packetthat can passthrough a NIC.After the value isincreased, thenumber ofnetwork packetscan be reducedand theefficiency can beimproved.

Default value:1500 bytesSymptom: Runthe ip addrcommand to viewthe value.Suggestion: Setthe maximumsize of a datapacket that canpass through aNIC to 9000bytes.

1. Run thefollowingcommand:vi /etc/sysconfig/network-scripts/ifcfg-$(Interface)AddMTU="9000".NOTE

${Interface}indicates thenetwork portname.

2. After theconfiguration iscomplete, restartthe networkservice.service network restart

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 11

Page 16: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion ConfigurationMethod

pid_max The defaultvalue ofpid_max is32768, which issufficient innormal cases.However, whenheavy workloadsare beingprocessed, thisvalue isinsufficient andmay causememoryallocationfailure.

Default value:32768Symptom: Runthe cat /proc/sys/kernel/pid_maxcommand to viewthe value.Suggestion: Setthe maximumnumber ofthreads that canbe generated inthe system to4194303.

Run the followingcommand:echo 4194303 > /proc/sys/kernel/pid_max

file_max Maximumnumber of filesthat can beopened by allprocesses in thesystem. Inaddition, someprograms cancall the setrlimitinterface to setthe limit on eachprocess. If thesystem generatesa large numberof errorsindicating thatfile handles areused up, increasethe value of thisparameter.

Default value:13291808Symptom: Runthe cat /proc/sys/fs/file-max command toview the value.Suggestion: Setthe maximumnumber of filesthat can beopened by allprocesses in thesystem to thevalue displayedafter the cat /proc/meminfo |grep MemTotal |awk '{print $2}'command is run.

Run the followingcommand:echo ${file-max} > /proc/sys/fs/file-max

NOTE${file-max} is thevalue displayed afterthe cat /proc/meminfo | grepMemTotal | awk'{print $2}' is run.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 12

Page 17: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion ConfigurationMethod

read_ahead Linux readaheadmeans that theLinux kernelprefetches acertain area ofthe specified fileand loads it intothe page cache.As a result, whenthe area isaccessedsubsequently,block caused bypage fault willnot occur.Reading datafrom memory ismuch faster thanreading datafrom drives.Therefore, thereadaheadfeature caneffectivelyreduce thenumber of driveseeks and theI/O waiting timeof theapplications. It isone of theimportantmethods foroptimizing thedrive read I/Operformance.

Default value:128 KBSymptom:Readahead caneffectively reducethe number ofdrive seeks andthe I/O waitingtime of theapplications. Runthe /sbin/blockdev --getra /dev/sdb toview the value.Suggestion:Change the valueto 8192 KB.Improve the driveread efficiency bypre-reading andrecording thedata to randomaccess memory(RAM).

Run the followingcommand:/sbin/blockdev --setra /dev/sdb

NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 13

Page 18: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion ConfigurationMethod

I/O_Scheduler

The Linux I/Oscheduler is acomponent ofthe Linux kernel.You can adjustthe scheduler tooptimize systemperformance.

Default value:CFQSymptom: TheLinux I/Oscheduler needsto be configuredbased ondifferent storagedevices for theoptimal systemperformance.Suggestion: Setthe I/Oscheduling policyto deadline forHDDs and noopfor SSDs.

Run the followingcommand:echo deadline > /sys/block/sdb/queue/scheduler

NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.

nr_requests If the Linuxsystem receives alarge number ofread requests,the defaultnumber ofrequest queuesmay beinsufficient. Todeal with thisproblem, you candynamicallyadjust thedefault numberof requestqueues inthe /sys/block/hda/queue/nr_requests file.

Default value:128Symptom:Increase the drivethroughput byadjusting thenr_requestsparameter.Suggestion: Setthe number ofdrive requestqueues to 512.

Run the followingcommand:echo 512 > /sys/block/sdb/queue/nr_requests

NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.

Optimizing the Network Performance● Purpose

This test uses the 25GE Ethernet adapter (Hi1822) with four ports, SFP+. It isused as an example to describe how to optimize the NIC parameters for theoptimal performance.

● Procedure

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 14

Page 19: Tuning Guides - HUAWEI CLOUD

The optimization methods include adjusting NIC parameters and interrupt-core binding (binding interrupts to the physical CPU of the NIC). Table 2-9describes the optimization items.

Table 2-9 NIC parameters

Parameter Description Suggestion

irqbalance System interruptbalancing service, whichautomatically allocatesNIC software interrupts toidle CPUs.

Default value: activeSymptom: When thisfunction is enabled, thesystem automaticallyallocates NIC softwareinterrupts to idle CPUs.Suggestion:● To disable irqbalance,

set this parameter toinactive.systemctl stop irqbalance

● Keep the functiondisabled after theserver is restarted.systemctl disable irqbalance

rx_buff Aggregation of largenetwork packets requiresmultiple discontinuousmemory pages and causeslow memory usage. Youcan increase the value ofthis parameter to improvethe memory usage.

Default value: 2Symptom: When the valueis set to 2 by default,interrupts consume alarge number of CPUresources.Suggestion: Load therx_buff parameter and setthe value to 8 to reducediscontinuous memoryand improve memoryusage and performance.For details, see descriptionfollowing the table.

ring_buffer You can increase thethroughput by adjustingthe NIC buffer size.

Default value: 1024Symptom: Run theethtool -g NIC namecommand to view thevalue.Suggestion: Change thering_buffer queue size to4096. For details, seedescription following thetable.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 15

Page 20: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

lro lro indicates large receiveoffload. After this functionis enabled, multiple smallpackets are aggregatedinto one large packet forbetter efficiency.

Default value: offSymptom: After thisfunction is enabled, themaximum throughputincreases significantly.Suggestion: Enable thelarge-receive-offloadfunction to help networksimprove the efficiency ofsending and receivingpackets. For details, seedescription following thetable.

hinicadm_lro_-ihinic0_-t_<NUM>

Received aggregatedpackets are sent after thetime specified by NUM (inmicroseconds). You canset the value to 256microseconds for betterefficiency.

Default value: 16microsecondsSymptom: This parameteris used with the LROfunction.Suggestion: Change thevalue to 256microseconds.

hinicadm_lro_-i_hinic0_-n_<NUM>

Received aggregatedpackets are sent after thenumber of aggregatedpackets reaches the valuespecified by <NUM>. Youcan set the value to 32 forbetter efficiency.

Default value: 4Symptom: This parameteris used with the LROfunction.Suggestion: Change thevalue to 32.

– Adjusting rx_buff

i. Go to the /etc/modprobe.d directory.cd /etc/modprobe.d

ii. Create the hinic.conf file.vi /etc/modprobe.d/hinic.conf

Add the following information to the file:options hinic rx_buff=8

iii. Reload the driver.rmmod hinicmodprobe hinic

iv. Check whether the value of rx_buff is changed to 8.cat /sys/bus/pci/drivers/hinic/module/parameters/rx_buff

– Adjusting ring_buffer

i. Change the buffer size from the default value 1024 to 4096.ethtool -G <NIC name> rx 4096 tx 4096

ii. Check the current buffer size.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 16

Page 21: Tuning Guides - HUAWEI CLOUD

ethtool -g <NIC name>

– Enabling LRO

i. Enable the LRO function for a NIC.ethtool -K <NIC name> lro on

ii. Check whether the function is enabled.ethtool -k <NIC name> | grep large-receive-offload

NO TE

In addition to optimizing the preceding parameters, you need to bind the NIC softwareinterrupts to the cores.

1. Disable the irqbalance service.

2. Query the NUMA node to which the NIC belongs:cat /sys/class/net/<Network port name>/device/numa_node

3. Query the CPU cores that correspond to the NUMA node.lscpu

4. Query the interrupt ID corresponding to the NIC.cat /proc/interrupts | grep <Network port name> | awk -F ':' '{print $1}'

5. Bind the software interrupt to the core corresponding to the NUMA node.echo <core number> > /proc/irq/ <Interrupt ID> /smp_affinity_list

2.2.3 Ceph Tuning

Modifying Ceph Configuration● Purpose

Adjust the Ceph configuration to maximize system resource usage.● Procedure

You can edit the /etc/ceph/ceph.conf file to modify all Ceph configurationparameters. For example, you can osd_pool_default_size = 4 to the /etc/ceph/ceph.conf file to change the default number of copies to 4 and run thesystemctl restart ceph.target command to restart the Ceph daemon processfor the change to take effect.The preceding operations take effect only on the current Ceph node. You needto modify the ceph.conf file on all Ceph nodes and restart the Ceph daemonprocess for the modification to take effect on the entire Ceph cluster. Table2-10 describes the Ceph optimization items.

Table 2-10 Ceph parameter configuration

Parameter Description Suggestion

[global]

cluster_network You can configure anetwork segmentdifferent from the publicnetwork for OSDreplication and databalancing to relieve thepressure on the publicnetwork.

Recommended value:192.168.4.0/24You can set this parameteras required as long as it isdifferent from the publicnetwork segment.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 17

Page 22: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

public_network Recommended value:192.168.3.0/24You can set this parameteras required as long as it isdifferent from the clusternetwork segment.

osd_pool_default_size

Number of copies Recommended value: 3

osd_memory_target

Size of memory thateach OSD process isallowed to obtain

Recommended value:4294967296

For details about how to optimize other parameters, see Table 2-11.

Table 2-11 Other parameter configuration

Parameter Description Suggestion

[global]

osd_pool_default_min_size

Minimum number of I/Ocopies that the PG canreceive. If a PG is in thedegraded state, its I/Ocapability is notaffected.

Default value: 0Recommended value: 1

cluster_network You can configure anetwork segmentdifferent from the publicnetwork for OSDreplication and databalancing to relieve thepressure on the publicnetwork.

This parameter has nodefault value.Recommended value:192.168.4.0/24

osd_memory_target

Size of memory thateach OSD process isallowed to obtain

Default value: 4294967296Recommended value:4294967296

[mon]

mon_clock_drift_allowed

Clock drift betweenMONs

Default value: 0.05Recommended value: 1

mon_osd_min_down_reporters

Minimum down OSDquantity that triggers areport to the MONs

Default value: 2Recommended value: 13

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 18

Page 23: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

mon_osd_down_out_interval

Number of seconds thatCeph waits before anOSD is marked as downor out

Default value: 600Recommended value: 600

[OSD]

osd_journal_size OSD journal size Default value: 5120Recommended value:20000

osd_max_write_size

Maximum size (in MB)of data that can bewritten by an OSD at atime

Default value: 90Recommended value: 512

osd_client_message_size_cap

Maximum size (in bytes)of data that can bestored in the memory bythe clients

Default value: 100Recommended value:2147483648

osd_deep_scrub_stride

Number of bytes thatcan be read during deepscrubbing

Default value: 524288Recommended value:131072

osd_map_cache_size

Size of the cache (inMB) that stores the OSDmap

Default value: 50Recommended value: 1024

osd_recovery_op_priority

Restoration priority. Thevalue ranges from 1 to63. A larger valueindicates higher resourceusage.

Default value: 3Recommended value: 2

osd_recovery_max_active

Number of activerestoration requests inthe same period

Default value: 3Recommended value: 10

osd_max_backfills Maximum number ofbackfills allowed by anOSD

Default value: 1Recommended value: 4

osd_min_pg_log_entries

Minimum number ofreserved PG logs

Default value: 3000Recommended value:30000

osd_max_pg_log_entries

Maximum number ofreserved PG logs

Default value: 3000Recommended value:100000

osd_mon_heartbeat_interval

Interval (in seconds) foran OSD to ping a MON

Default value: 30Recommended value: 40

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 19

Page 24: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

ms_dispatch_throttle_bytes

Maximum number ofmessages to bedispatched

Default value: 104857600Recommended value:1048576000

objecter_inflight_ops

Allowed maximumnumber of unsent I/Orequests. This parameteris used for client trafficcontrol. If the number ofunsent I/O requestsexceeds the threshold,the application I/O isblocked. The value 0indicates that thenumber of unsent I/Orequests is not limited.

Default value: 1024Recommended value:819200

osd_op_log_threshold

Number of operationlogs to be displayed at atime

Default value: 5Recommended value: 50

osd_crush_chooseleaf_type

Bucket type when theCRUSH rule useschooseleaf

Default value: 1Recommended value: 0

journal_max_write_bytes

Maximum number ofjournal bytes that canbe written at a time

Default value: 10485760Recommended value:1073714824

journal_max_write_entries

Maximum number ofjournal records that canbe written at a time

Default value: 100Recommended value:10000

[Client]

rbd_cache RBD cache Default value: TrueRecommended value: True

rbd_cache_size RBD cache size (inbytes)

Default value: 33554432Recommended value:335544320

rbd_cache_max_dirty

Maximum number ofdirty bytes allowedwhen the cache is set tothe writeback mode. Ifthe value is 0, the cacheis set to thewritethrough mode.

Default value: 25165824Recommended value:134217728

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 20

Page 25: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

rbd_cache_max_dirty_age

Duration (in seconds)for which the dirty datais stored in the cachebefore being flushed tothe drives

Default value: 1Recommended value: 30

rbd_cache_writethrough_until_flush

This parameter is usedfor compatibility withthe virtio driver earlierthan linux-2.6.32. Itprevents the situationthat data is written backwhen no flush request issent. After thisparameter is set, librbdprocesses I/Os inwritethrough mode. Themode is switched towriteback only after thefirst flush request isreceived.

Default value: TrueRecommended value: False

rbd_cache_max_dirty_object

Maximum number ofobjects. The defaultvalue is 0, whichindicates that thenumber is calculatedbased on the RBD cachesize. By default, librbdlogically splits the driveimage in a unit of 4 MB.Each chunk object isabstracted as an object.librbd manages thecache object. You canincrease the value of thisparameter improve theperformance.

Default value: 0Recommended value: 2

rbd_cache_target_dirty

Dirty data size thattriggers writeback. Thevalue cannot exceed thevalue ofrbd_cache_max_dirty.

Default value: 16777216Recommended value:235544320

Optimizing the PG Distribution● Purpose

Adjust the number of PGs on each OSD to balance the load on each OSD.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 21

Page 26: Tuning Guides - HUAWEI CLOUD

● ProcedureBy default, Ceph allocates eight PGs/PGPs to each storage pool. Whencreating a storage pool, run the ceph osd pool create {pool-name} {pg-num}{pgp-num} command to specify the number of PGs/PGPs, or run the ceph osdpool set {pool_name} pg_num {pg-num} and ceph osd pool set {pool_name}pgp_num {pgp-num} command to change the number of PGs/PGPs createdin a storage pool. After the modification, run the ceph osd pool get{pool_name} pg_num/pgp_num command to check the number of PGs/PGPsin the storage pool.The default value of ceph balancer mode is none. You can run the cephbalancer mode upmap command to change it to upmap. The Ceph balancerfunction is disabled by default. You can run the ceph balancer on/offcommand is used to enable or disable the Ceph balancer function.Table 2-12 describes the PG distribution parameters.

Table 2-12 PG distribution parameters

Parameter Description Suggestion

pg_num Total PGs =(Total_number_of_OSDx 100)/max_replication_countRound up the result tothe nearest integerpower of 2.

Default value: 8Symptom: A warning isdisplayed if the number ofPGs is insufficient.Suggestion: Calculate thevalue based on theformula.

pgp_num Set the number of PGPsto be the same as thatof PGs.

Default value: 8Symptom: It isrecommended that thenumber of PGPs be thesame as the number ofPGs.Suggestion: Calculate thevalue based on theformula.

ceph_balancer_mode

Enable the balancerplug-in and set theplug-in mode toupmap.

Default value: noneSymptom: If the number ofPGs is unbalanced, someOSDs may be overloadedand become bottlenecks.Recommended value:upmap

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 22

Page 27: Tuning Guides - HUAWEI CLOUD

NO TE

● The number of PGs carried by each OSD must be the same or close. Otherwise,some OSDs may be overloaded and become bottlenecks. The balancer plug-in canbe used to optimize the PG distribution. You can run the ceph balancer eval orceph pg dump command to view the PG distribution.

● Run the eph balancer mode upmap and ceph balancer on commands toautomatically balance and optimize Ceph PGs. Ceph adjusts the distribution of afew PGs every 60 seconds. Run the ceph balancer eval or ceph pg dumpcommand to view the PG distribution. If the PG distribution does not change, thedistribution is optimal.

● The PG distribution of each OSD affects the load balancing of write data. Inaddition to optimizing the number of PGs corresponding to each OSD, thedistribution of the primary PGs also needs to be optimized. That is, the primaryPGs need to be distributed to each OSD as evenly as possible.

Binding OSDs to CPU Cores● Purpose

Bind each OSD process to a fixed CPU core.● Procedure

Add osd_numa_node = <NUM> to the /etc/ceph/ceph.conf file.Table 2-13 describes the optimization items.

Table 2-13 OSD core binding parameters

Parameter Description Suggestion

[osd.n]

osd_numa_node Bind the osd.n daemonprocess to a specified idleNUMA node, which is anode other than thenodes that process theNIC software interrupt.

This parameter has nodefault value.Symptom: If the CPU ofeach OSD process is thesame as that of the NICinterrupt, some CPUs maybe overloaded.Suggestion: To balance theCPU load pressure, avoidrunning each OSD processand NIC interrupt process(or other processes withhigh CPU usage) on thesame NUMA node.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 23

Page 28: Tuning Guides - HUAWEI CLOUD

NO TE

● The Ceph OSD daemon process and NIC software interrupt process must run ondifferent NUMA nodes. Otherwise, CPU bottlenecks may occur when the networkload is heavy. By default, Ceph evenly allocates OSD processes to all CPU cores.You can add the osd_numa_node parameter to the ceph.conf file to avoid runningeach OSD process and NIC interrupt process (or other processes with high CPUusage) on the same NUMA node.

● Optimizing the Network Performance describes how to bind NIC softwareinterrupts to the CPU core of the NUMA node to which the NIC belongs. When thenetwork load is heavy, the usage of the CPU core bound to the software interruptsis high. Therefore, you are advised to set osd_numa_node to a NUMA nodedifferent from that of the NIC. For example, run the cat /sys/class/net/PortName/device/numa_node command to query the NUMA node of the NIC. If theNIC belongs to NUMA node 2, set osd_numa_node = 0 or osd_numa_node = 1 toprevent the OSD and NIC software interrupt from using the same CPU core.

Optimizing Compression Algorithm Configuration Parameters● Purpose

Adjust the compression algorithm configuration parameters to optimize theperformance of the compression algorithm.

● ProcedureThe default value of bluestore_min_alloc_size_hdd for Ceph is 32 KB. Thevalue of this parameter affects the size of the final data obtained after thecompression algorithm is run. Set this parameter to a smaller value tomaximize the compression rate of the compression algorithm.By default, Ceph uses five threads to process I/O requests in an OSD process.After the compression algorithm is enabled, the number of threads can causea performance bottleneck. Increase the number of threads to maximize theperformance of the compression algorithm.The following table describes the PG distribution parameters:

Parameter Description Suggestion

bluestore_min_alloc_size_hdd

Minimum size of objectsallocated to the HDDdata disks in theBlueStore storageengine

Default value: 32768Recommended value: 8192

osd_op_num_shards_hdd

Number of shards for anHDD data disk in anOSD process

Default value: 5Recommended value: 12

osd_op_num_threads_per_shard_hdd

Average number ofthreads of an OSDprocess for each HDDdata disk shard

Default value: 1Recommended value: 2

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 24

Page 29: Tuning Guides - HUAWEI CLOUD

Enabling BcacheBcache is a block layer cache of the Linux kernel. It uses SSDs as the cache ofHDDs for acceleration. To enable the Bcache kernel module, you need torecompile the kernel. For details, see the Bcache User Guide (CentOS 7.6).

Using the I/O Passthrough ToolThe I/O passthrough tool is a process optimization tool for balanced scenarios ofthe Ceph cluster. It can automatically detect and optimize OSDs in the Cephcluster. For details on how to use this tool, see the I/O Passthrough Tool UserGuide.

2.2.4 KAE zlib Compression Tuning● Purpose

Optimize zlib compression to maximize the CPU capability of processing OSDsand maximize the hardware performance.

● Procedurezlib compression is processed by the KAE.

Preparing the EnvironmentNO TE

Before installing the accelerator engine, you need to apply for and install a license.License application guide:https://support.huawei.com/enterprise/zh/doc/EDOC1100068122/b9878159Installation guide:https://support.huawei.com/enterprise/en/doc/EDOC1100048786/ba20dd15

Download the acceleration engine installation package and developer Guide.

Download link: https://github.com/kunpengcompute/KAE/tags

Installing the Acceleration EngineNO TE

The developer guide describes how to install and use all modules of the accelerator engine.Select an appropriate installation mode based on the developer guide.For details, see Installing the KAE Software Package Using Source Code.

Step 1 Install the acceleration engine according to the developer guide.

Step 2 Install the zlib library.

1. Download KAEzip.2. Download zlib-1.2.11.tar.gz from the zlib official website and copy it to

KAEzip/open_source.3. Perform the compilation and installation.

cd KAEzipsh setup.sh install

The zlib library is installed in /usr/local/kaezip.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 25

Page 30: Tuning Guides - HUAWEI CLOUD

Step 3 Back up the connection.mv /lib64/libz.so.1 /lib64/libz.so.1-bak

Step 4 Replace the zlib software compression algorithm dynamic library.cd /usr/local/kaezip/libcp libz.so.1.2.11 /lib64/mv /lib64/libz.so.1 /lib64/libz.so.1-bakln -s /lib64/libz.so.1.2.11 /lib64/libz.so.1

NO TE

In the cd /usr/local/zlib command, /usr/local/zlib indicates the zlib installation path.Change it as required.

----End

NO TE

If the Ceph cluster is running before the dynamic library is replaced, run the followingcommand on all storage nodes to restart the OSDs for the change to take effect after thedynamic library is replaced:systemctl restart ceph-osd.target

Changing the Default Number of Accelerator QueuesNO TE

The default number of queues of the hardware accelerator is 256. To fully utilize theperformance of the accelerator, change the number of queues to 512 or 1024.

Step 1 Remove hisi_zip.rmmod hisi_zip

Step 2 Set the default accelerator queue parameter pf_q_num=512.vi /etc/modprobe.d/hisi_zip.confoptions hisi_zip uacce_mode=2 pf_q_num=512

Step 3 Load hisi_zip.modprobe hisi_zip

Step 4 Check the hardware accelerator queue.cat /sys/class/uacce/hisi_zip-*/attrs/available_instances

The change is successful if the following information is displayed.

Step 5 Check the dynamic library links. If libwd.so.1 is contained in the command output,the operation is successful.ldd /lib64/libz.so.1

----End

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 26

Page 31: Tuning Guides - HUAWEI CLOUD

Adapting Ceph to the AcceleratorNO TE

Currently, the mainline Ceph versions allow configuring the zlib compression mode usingthe configuration file. The released Ceph release versions (up to v15.2.3) adopt the zlibcompression mode without the data header and tail. However, the current hardwareacceleration library supports only the mode with the data header and tail. Therefore, theCeph source code needs to be modified to adapt to the Kunpeng hardware accelerationlibrary. For details about the modification method, see the latest patch that has beenincorporated into the mainline version:https://github.com/ceph/ceph/pull/34852The following uses Ceph 14.2.11 as an example to describe how Ceph adapts to the zlibcompression engine.

Step 1 Obtain the source code.

Source code download address: https://download.ceph.com/tarballs/

After the source code package is downloaded, save it to the /home directory onthe server.

Step 2 Obtain the patch and save it to the /home directory.

https://github.com/kunpengcompute/ceph/releases/download/v14.2.11/ceph-14.2.11-glz.patch

Step 3 Go to the /home directory, decompress the source code package and enter thedirectory generated after decompression.cd /home && tar -zxvf ceph-14.2.11.tar.gz && cd ceph-14.2.11/

Step 4 Apply the patch in the root directory of the source code.cd /home/ceph-14.2.11patch -p1 < ceph-14.2.11-glz.patch

Step 5 After modifying the source code, compile Ceph.● CentOS: See Ceph 14.2.1 Porting Guide (CentOS 7.6).● openEuler: See Ceph 14.2.8 Porting Guide (openEuler 20.03).

Step 6 Install Ceph.

Step 7 Modify the ceph.conf file to configure the zlib compression mode.vi /etc/ceph/ceph.confcompressor_zlib_winsize=15

Step 8 Restart the Ceph cluster for the configuration to take effect.ceph daemon osd.0 config show|grep compressor_zlib_winsize

----End

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 27

Page 32: Tuning Guides - HUAWEI CLOUD

2.3 High-Performance Storage

2.3.1 Hardware TuningHigh-Performance Configuration Tuning

● PurposeBalance the loads of the two CPUs.

● ProcedureEvenly distribute the NVMe SSDs and NICs to the two CPUs.

HardwareType

OptimizationItem

Description

NIC NUMA resourcebalancing

For example, you can insert the LOMsinto the PCIe slots of CPU 1 and theMellanox ConnectX-4 NICs into thePCIe slots of CPU 2 to balance theloads of the two CPUs.

Storage NUMA resourcebalancing

For example, you can insert six NVMeSSDs into the PCIe slots of CPU 1 andthe other six NVMe SSDs into the PCIeslots of CPU 2 to balance the loads ofthe two CPUs.

2.3.2 System Tuning

Optimizing the OS Configuration● Purpose

Adjust the system configuration to maximize the hardware performance.● Procedure

Table 2-14 lists the optimization items.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 28

Page 33: Tuning Guides - HUAWEI CLOUD

Table 2-14 OS configuration parameters

Parameter Description Suggestion ConfigurationMethod

vm.swappiness

The swappartition is thevirtual memoryof the system.Do not use theswap partitionbecause it willdeterioratesystemperformance.

Default value: 60Symptom: Theperformancedeterioratessignificantly whenthe swappartition is used.Suggestion:Disable the swappartition and setthis parameter to0.

Run the followingcommand:sudo sysctl vm.swappiness=0

MTU Maximum size ofa data packetthat can passthrough a NIC.After the value isincreased, thenumber ofnetwork packetscan be reducedand theefficiency can beimproved.

Default value:1500 bytesSymptom: Runthe ip addrcommand to viewthe value.Suggestion: Setthe maximumsize of a datapacket that canpass through aNIC to 9000bytes.

1. Run thefollowingcommand:vi /etc/sysconfig/network-scripts/ifcfg-$(Interface)AddMTU="9000".NOTE

${Interface}indicates thenetwork portname.

2. After theconfiguration iscomplete, restartthe networkservice.service network restart

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 29

Page 34: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion ConfigurationMethod

pid_max The defaultvalue ofpid_max is32768, which issufficient innormal cases.However, whenheavy workloadsare beingprocessed, thisvalue isinsufficient andmay causememoryallocationfailure.

Default value:32768Symptom: Runthe cat /proc/sys/kernel/pid_maxcommand to viewthe value.Suggestion: Setthe maximumnumber ofthreads that canbe generated inthe system to4194303.

Run the followingcommand:echo 4194303 > /proc/sys/kernel/pid_max

file_max Maximumnumber of filesthat can beopened by allprocesses in thesystem. Inaddition, someprograms cancall the setrlimitinterface to setthe limit on eachprocess. If thesystem generatesa large numberof errorsindicating thatfile handles areused up, increasethe value of thisparameter.

Default value:13291808Symptom: Runthe cat /proc/sys/fs/file-max command toview the value.Suggestion: Setthe maximumnumber of filesthat can beopened by allprocesses in thesystem to thevalue displayedafter the cat /proc/meminfo |grep MemTotal |awk '{print $2}'command is run.

Run the followingcommand:echo ${file-max} > /proc/sys/fs/file-max

NOTE${file-max} is thevalue displayed afterthe cat /proc/meminfo | grepMemTotal | awk'{print $2}' is run.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 30

Page 35: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion ConfigurationMethod

read_ahead Linux readaheadmeans that theLinux kernelprefetches acertain area ofthe specified fileand loads it intothe page cache.As a result, whenthe area isaccessedsubsequently,block caused bypage fault willnot occur.Reading datafrom memory ismuch faster thanreading datafrom drives.Therefore, thereadaheadfeature caneffectivelyreduce thenumber of driveseeks and theI/O waiting timeof theapplications. It isone of theimportantmethods foroptimizing thedrive read I/Operformance.

Default value:128 KBSymptom:Readahead caneffectively reducethe number ofdrive seeks andthe I/O waitingtime of theapplications. Runthe /sbin/blockdev --getra /dev/sdb toview the value.Suggestion:Change the valueto 8192 KB.Improve the driveread efficiency bypre-reading andrecording thedata to randomaccess memory(RAM).

Run the followingcommand:/sbin/blockdev --setra /dev/sdb

NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 31

Page 36: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion ConfigurationMethod

I/O_Scheduler

The Linux I/Oscheduler is acomponent ofthe Linux kernel.You can adjustthe scheduler tooptimize systemperformance.

Default value:CFQSymptom: TheLinux I/Oscheduler needsto be configuredbased ondifferent storagedevices for theoptimal systemperformance.Suggestion: Setthe I/Oscheduling policyto deadline forHDDs and noopfor SSDs.

Run the followingcommand:echo deadline > /sys/block/sdb/queue/scheduler

NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.

nr_requests If the Linuxsystem receives alarge number ofread requests,the defaultnumber ofrequest queuesmay beinsufficient. Todeal with thisproblem, you candynamicallyadjust thedefault numberof requestqueues inthe /sys/block/hda/queue/nr_requests file.

Default value:128Symptom:Increase the drivethroughput byadjusting thenr_requestsparameter.Suggestion: Setthe number ofdrive requestqueues to 512.

Run the followingcommand:echo 512 > /sys/block/sdb/queue/nr_requests

NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.

NUMA Affinity Tuning● Procedure

Evenly allocate network and storage resources to NUMA nodes.● Purpose

In this example, 12 NVMe SSDs and four network ports are evenly allocatedto four NUMA nodes.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 32

Page 37: Tuning Guides - HUAWEI CLOUD

The NVMe SSD numbers range from 0 to 11, and the network port names areenps0f0, enps0f1, enps0f2, and enps0f3.for i in {0..11}; do echo `expr ${i} / 3` > /sys/class/block/nvme${i}n1/device/device/numa_node; donefor j in {0..3}; do echo ${j} > /sys/class/net/enps0f${j}/device/numa_node; done

2.3.3 Ceph Tuning

Ceph Configuration Tuning● Purpose

Adjust the Ceph configuration items to fully utilize the hardware performanceof the system.

● Procedure

You can edit the /etc/ceph/ceph.conf file to modify all Ceph configurationparameters.

For example, to change the number of copies to 4, you can addosd_pool_default_size = 4 to the /etc/ceph/ceph.conf file and run thesystemctl restart ceph.target command to restart the Ceph daemon processfor the change to take effect.

The preceding operations take effect only on the current Ceph node. You needto modify the ceph.conf file on all Ceph nodes and restart the Ceph daemonprocess for the modification to take effect on the entire Ceph cluster.

Table 2-15 lists the optimization items.

Table 2-15 Ceph parameter configuration

Parameter Description Suggestion

[global]

osd_pool_default_min_size

Minimum number of I/Ocopies that the PG canreceive. If a PG is in thedegraded state, its I/Ocapability is notaffected.

Default value: 0Suggestion: Set thisparameter to 1.

cluster_network You can configure anetwork segmentdifferent from the publicnetwork for OSDreplication and databalancing to relieve thepressure on the publicnetwork.

Default value: noneSuggestion: Set thisparameter to192.168.4.0/24.

osd_pool_default_size

Number of copies Default value: 3Suggestion: Set thisparameter to 3.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 33

Page 38: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

mon_max_pg_per_osd

PG alarm threshold. Youcan increase the valuefor better performance.

Default value: 250Suggestion: Set thisparameter to 3000.

mon_max_pool_pg_num

PG alarm threshold. Youcan increase the valuefor better performance.

Default value: 65536Suggestion: Set thisparameter to 300000.

debug_none Disable the debuggingfunction to reduce thelog printing overheads.

Suggestion: Set thisparameter to 0/0.

debug_lockdep

debug_context

debug_crush

debug_mds

debug_mds_balancer

debug_mds_locker

debug_mds_log

debug_mds_log_expire

debug_mds_migrator

debug_buffer

debug_timer

debug_filer

debug_striper

debug_objecter

debug_rados

debug_rbd

debug_rbd_mirror

debug_rbd_replay

debug_journaler

debug_objectcacher

debug_client

debug_osd

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 34

Page 39: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

debug_optracker

debug_objclass

debug_filestore

debug_journal

debug_ms

debug_mon

debug_monc

debug_paxos

debug_tp

debug_auth

debug_crypto

debug_finisher

debug_reserver

debug_heartbeatmap

debug_perfcounter

debug_rgw

debug_civetweb

debug_javaclient

debug_asok

debug_throttle

debug_refs

debug_xio

debug_compressor

debug_bluestore

debug_bluefs

debug_bdev

debug_kstore

debug_rocksdb

debug_leveldb

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 35

Page 40: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

debug_memdb

debug_kinetic

debug_fuse

debug_mgr

debug_mgrc

debug_dpdk

debug_eventtrace

throttler_perf_counter

This function is enabledby default. You cancheck whether thethreshold is abottleneck. After theoptimal performance isobtained, you areadvised to disable thetracker. The trackeraffects the performance.

Default value: TrueSuggestion: Set thisparameter to False.

ms_dispatch_throttle_bytes

Maximum number ofmessages to bescheduled. You areadvised to increase thevalue to improve themessage processingefficiency.

Default value: 104857600Suggestion: Set thisparameter to 2097152000.

ms_bind_before_connect

Message queue binding,which ensures thattraffic of multiplenetwork ports isbalanced.

Default value: FalseSuggestion: Set thisparameter to True.

[client]

rbd_cache Disable the client cache.After the function isdisabled, the RBD cacheis always inwritethrough mode.

Default value: TrueSuggestion: Set thisparameter to False.

[osd]

osd_max_write_size

Maximum size (in MB)of data that can bewritten by an OSD at atime

Default value: 90Suggestion: Set thisparameter to 256.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 36

Page 41: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

osd_client_message_size_cap

Maximum size (in bytes)of data that can bestored in the memory bythe clients

Default value: 524288000Suggestion: Set thisparameter to 1073741824.

osd_map_cache_size

Size of the cache (inMB) that stores the OSDmap

Default value: 50Suggestion: Set thisparameter to 1024.

bluestore_rocksdb_options

RocksDB configurationparameter

Default value: compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2

Suggestion:compression=kNoCompression,max_write_buffer_number=64,min_write_buffer_number_to_merge=32,recycle_log_file_num=64,compaction_style=kCompactionStyleLe-vel,write_buffer_size=4MB,target_file_size_base=4MB,max_background_compactions=16,level0_file_num_compaction_trigger=64,level0_slowdown_writes_trigger=128,level0_stop_writes_trigger=256,max_bytes_for_level_base=6GB,compaction_threads=8,flusher_threads=4,compaction_readahead_size=2MB

bluestore_csum_type

The checksum type isnot specified.

Default value: crc32cSuggestion: none

mon_osd_full_ratio

Percentage of used drivespace when an OSD isconsidered to be full.When the data volumeexceeds this percentage,all read and writeoperations are stoppeduntil the drive space isexpanded or data iscleared so that thepercentage of used drivespace is less than thevalue.

Default value: 0.95Suggestion: Set thisparameter to 0.97.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 37

Page 42: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

mon_osd_nearfull_ratio

Percentage of used drivespace when an OSD isregarded as almost usedup. When the datavolume exceeds thispercentage, an alarm isgenerated indicatingthat the space is aboutto be used up.

Default value: 0.85Suggestion: Set thisparameter to 0.95.

osd_min_pg_log_entries

Lower limit of thenumber of PG logs

Default value: 3000Suggestion: Set thisparameter to 10.

osd_max_pg_log_entries

Upper limit of thenumber of PG logs

Default value: 3000Suggestion: Set thisparameter to 10.

bluestore_cache_meta_ratio

Ratio of BlueStore cacheallocated to metadata.

Default value: 0.4Suggestion: Set thisparameter to 0.8.

bluestore_cache_kv_ratio

Ratio of BlueStore cacheallocated to key/valuedata.

Default value: 0.4Suggestion: Set thisparameter to 0.2.

Optimizing the PG Distribution● Purpose

Adjust the number of PGs on each OSD to balance the load on each OSD.● Procedure

By default, Ceph allocates eight PGs/PGPs to each storage pool. Whencreating a storage pool, run the ceph osd pool create {pool-name} {pg-num}{pgp-num} command to specify the number of PGs/PGPs, or run the ceph osdpool set {pool_name} pg_num {pg-num} and ceph osd pool set {pool_name}pgp_num {pgp-num} command to change the number of PGs/PGPs createdin a storage pool. After the modification, run the ceph osd pool get{pool_name} pg_num/pgp_num command to check the number of PGs/PGPsin the storage pool.The default value of ceph balancer mode is none. You can run the cephbalancer mode upmap command to change it to upmap. The Ceph balancerfunction is disabled by default. You can run the ceph balancer on/offcommand is used to enable or disable the Ceph balancer function.Table 2-16 describes the PG distribution parameters.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 38

Page 43: Tuning Guides - HUAWEI CLOUD

Table 2-16 PG distribution parameters

Parameter Description Suggestion

pg_num Total PGs =(Total_number_of_OSDx 100)/max_replication_countRound up the result tothe nearest integerpower of 2.

Default value: 8Symptom: A warning isdisplayed if the number ofPGs is insufficient.Suggestion: Calculate thevalue based on theformula.

pgp_num Set the number of PGPsto be the same as thatof PGs.

Default value: 8Symptom: It isrecommended that thenumber of PGPs be thesame as the number ofPGs.Suggestion: Calculate thevalue based on theformula.

ceph_balancer_mode

Enable the balancerplug-in and set theplug-in mode toupmap.

Default value: noneSymptom: If the number ofPGs is unbalanced, someOSDs may be overloadedand become bottlenecks.Recommended value:upmap

NO TE

● The number of PGs carried by each OSD must be the same or close. Otherwise,some OSDs may be overloaded and become bottlenecks. The balancer plug-in canbe used to optimize the PG distribution. You can run the ceph balancer eval orceph pg dump command to view the PG distribution.

● Run the eph balancer mode upmap and ceph balancer on commands toautomatically balance and optimize Ceph PGs. Ceph adjusts the distribution of afew PGs every 60 seconds. Run the ceph balancer eval or ceph pg dumpcommand to view the PG distribution. If the PG distribution does not change, thedistribution is optimal.

● The PG distribution of each OSD affects the load balancing of write data. Inaddition to optimizing the number of PGs corresponding to each OSD, thedistribution of the primary PGs also needs to be optimized. That is, the primaryPGs need to be distributed to each OSD as evenly as possible.

Binding OSDs to CPU Cores● Purpose

Bind each OSD process to a fixed CPU core.● Procedure

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 39

Page 44: Tuning Guides - HUAWEI CLOUD

Add osd_numa_node = <NUM> to the /etc/ceph/ceph.conf file.Table 2-17 describes the optimization items.

Table 2-17 OSD core binding parameters

Parameter Description Suggestion

[osd.n]

osd_numa_node Bind the osd.n daemonprocess to a specified idleNUMA node, which is anode other than thenodes that process theNIC software interrupt.

This parameter has nodefault value.Symptom: If the CPU ofeach OSD process is thesame as that of the NICinterrupt, some CPUs maybe overloaded.Suggestion: To balance theCPU load pressure, avoidrunning each OSD processand NIC interrupt process(or other processes withhigh CPU usage) on thesame NUMA node.

NO TE

● The Ceph OSD daemon process and NIC software interrupt process must run ondifferent NUMA nodes. Otherwise, CPU bottlenecks may occur when the networkload is heavy. By default, Ceph evenly allocates OSD processes to all CPU cores.You can add the osd_numa_node parameter to the ceph.conf file to avoid runningeach OSD process and NIC interrupt process (or other processes with high CPUusage) on the same NUMA node.

● Optimizing the Network Performance describes how to bind NIC softwareinterrupts to the CPU core of the NUMA node to which the NIC belongs. When thenetwork load is heavy, the usage of the CPU core bound to the software interruptsis high. Therefore, you are advised to set osd_numa_node to a NUMA nodedifferent from that of the NIC. For example, run the cat /sys/class/net/PortName/device/numa_node command to query the NUMA node of the NIC. If theNIC belongs to NUMA node 2, set osd_numa_node = 0 or osd_numa_node = 1 toprevent the OSD and NIC software interrupt from using the same CPU core.

Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 40

Page 45: Tuning Guides - HUAWEI CLOUD

3 Ceph Object Storage Tuning Guide

3.1 Introduction

3.2 Cold Storage

3.3 General-Purpose Storage

3.4 High-Performance Storage

3.1 Introduction

3.1.1 Overview

CephCeph is a distributed, scalable, reliable, and high-performance storage systemplatform that supports storage interfaces including block devices, file systems, andobject gateways. The optimization methods described in this document includehardware optimization and software configuration optimization. Software codeoptimization is not involved. By adjusting the system and Ceph configurationparameters, Ceph can fully utilize the hardware performance of the system. CephPlacement Group (PG) distribution optimization and object storage daemon(OSD) core binding aim to balance drive loads and prevent any OSD frombecoming a bottleneck. In addition, in general-purpose storage scenarios, usingNVMe SSDs as Bcache can also improve performance. Figure 3-1 shows the Cepharchitecture.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 41

Page 46: Tuning Guides - HUAWEI CLOUD

Figure 3-1 Ceph architecture

Table 3-1 describes the Ceph modules and components.

Table 3-1 Module functions

Module Function

RADOS Reliable Autonomic Distributed Object Store (RADOS) is theheart of a Ceph storage cluster. Everything in Ceph is stored byRADOS in the form of objects irrespective of their data types. TheRADOS layer ensures data consistency and reliability throughdata replication, fault detection and recovery, and data recoveryacross cluster nodes.

OSD Object storage daemons (OSDs) store the actual user data. EveryOSD is usually bound to one physical drive. The OSDs handle theread/write requests from clients.

MON The monitor (MON) is the most important component in a Cephcluster. It manages the Ceph cluster and maintains the status ofthe entire cluster. The MON ensures that related components ofa cluster can be synchronized at the same time. It functions asthe leader of the cluster and is responsible for collecting,updating, and publishing cluster information. To avoid singlepoints of failure (SPOFs), multiple MONs are deployed in a Cephenvironment, and they must handle the collaboration betweenthem.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 42

Page 47: Tuning Guides - HUAWEI CLOUD

Module Function

MGR The manager (MGR) is a monitoring system that providescollection, storage, analysis (including alarming), andvisualization functions. It makes certain cluster parametersavailable for external systems.

Librados Librados is a method that simplifies access to RADOS. Currently,it supports programming languages PHP, Ruby, Java, Python, C,and C++. It provides RADOS, a local interface of the Ceph storagecluster, and is the base component of other services such as theRADOS block device (RBD) and RADOS gateway (RGW). Inaddition, it provides the Portable Operating System Interface(POSIX) for the Ceph file system (CephFS). The Librados API canbe used to directly access RADOS, enabling developers to createtheir own interfaces for accessing the Ceph cluster storage.

RBD The RADOS block device (RBD) is the Ceph block device thatprovides block storage for external systems. It can be mapped,formatted, and mounted like a drive to a server.

RGW The RADOS gateway (RGW) is a Ceph object gateway thatprovides RESTful APIs compatible with S3 and Swift. The RGWalso supports multi-tenant and OpenStack Identity service(Keystone).

MDS The Ceph Metadata Server (MDS) tracks the file hierarchy andstores metadata used only for CephFS. The RBD and RGW do notrequire metadata. The MDS does not directly provide dataservices for clients.

CephFS The CephFS provides a POSlX-compatible distributed file systemof any size. It depends on the Ceph MDS to track the filehierarchy, namely the metadata.

3.1.2 Environment

Physical NetworkingThe physical environment of the Ceph block devices contains two network layersand three nodes. In the physical environment, the MON, MGR, MDS, and OSDnodes are deployed together. At the network layer, the public network is separatedfrom the cluster network. The two networks use 25GE optical ports forcommunication.

Figure 3-2 shows the physical network.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 43

Page 48: Tuning Guides - HUAWEI CLOUD

Figure 3-2 Physical networking

Hardware ConfigurationTable 3-2 shows the Ceph hardware configuration.

Table 3-2 Hardware configuration

Server TaiShan 200 server (model 2280)

Processor Kunpeng 920 5230 processor

Core 2 x 32-core

CPU frequency 2600 MHz

Memory capacity 12 x 16 GB

Memory frequency 2666 MHz (8 Micron 2R memory modules)

NIC IN200 NIC (4 x 25GE)

Drive System drives: RAID 1 (2 x 960 GB SATA SSDs)Data drives of general-purpose storage: JBOD enabledin RAID mode (12 x 4 TB SATA HDDs)

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 44

Page 49: Tuning Guides - HUAWEI CLOUD

NVMe SSD Acceleration drive of general-purpose storage: 1 x 3.2TB ES3600P V5 NVMe SSDData drives of high-performance storage: 12 x 3.2 TBES3600P V5 NVMe SSDs

RAID controller card Avago SAS 3508

Software VersionsTable 3-3 lists the required software versions.

Table 3-3 Software versions

Software Version

OS CentOS Linux release 7.6.1810

openEuler 20.03 LTS SP1

Ceph 14.2.1 Nautilus

ceph-deploy 2.0.1

CosBench 0.4.2.c4

Node InformationTable 3-4 describes the IP network segment planning of the hosts.

Table 3-4 Node information

Host Type HostName

Public NetworkSegment

Cluster NetworkSegment

OSD/MON node Node 1 192.168.3.0/24 192.168.4.0/24

OSD/MGR node Node 2 192.168.3.0/24 192.168.4.0/24

OSD/MDS node Node 3 192.168.3.0/24 192.168.4.0/24

Component DeploymentTable 3-5 describes the deployment of service components in the Ceph blockdevice cluster.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 45

Page 50: Tuning Guides - HUAWEI CLOUD

Table 3-5 Component deployment

Physical MachineName

OSD MON MGR

Node 1 12 OSDs 1 MON 1 MGR

Node 2 12 OSDs 1 MON 1 MGR

Node 3 12 OSDs 1 MON 1 MGR

Cluster CheckRun the ceph health command to check the cluster health status. If HEALTH_OKis displayed, the cluster is running properly.

3.1.3 Tuning Guidelines and Process FlowThe object storage tuning varies with the hardware configuration.

● Cold storageAll data drives are hard disk drives (HDDs). That is, DB/WAL partitions andmetadata storage pools use HDDs.

● General-purpose storageHDDs are used as data drives, and SSDs are used as DB and WAL partitionsand metadata storage pools.

● High-performance storageAll data drives are SSDs.

Perform the tuning based on your hardware configuration.

Tuning GuidelinesPerformance optimization must comply with the following principles:

● When analyzing the performance, analyze the system resource bottlenecksfrom multiple aspects. For example, insufficient memory capacity may causethe CPU to be occupied by memory scheduling tasks and the CPU usage toreach 100%.

● Adjust only one performance parameter at a time.● The analysis tool may consume system resources and aggravate certain

system resource bottlenecks. Therefore, the impact on applications must beavoided or minimized.

Tuning Process FlowThe tuning analysis flow is as follows:

1. In many cases, pressure test traffic is not completely sent to the backend(server). For example, a protection policy may be triggered on network accesslayer services such as Server Load Balancing (SLB), Web Application Firewall(WAF), High Defense IP, and even Content Delivery Network (CDN) /site

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 46

Page 51: Tuning Guides - HUAWEI CLOUD

acceleration in a cloud-based architecture. This occurs because thespecifications, such as bandwidth, maximum number of connections, andnumber of new connections, are limited, or the pressure test shows thefeatures of Challenge Collapsar (CC) and Distributed Denial of Service (DDoS)attacks. As a result, the pressure test results do not meet expectations.

2. Check whether the key indicators meet the requirements. If not, locate thefault. The fault may be caused by the servers (in most cases) or the clients (ina few cases).

3. If the problem is caused by the servers, focus on the hardware indicators suchas the CPU, memory, drive I/O, and network I/O. Locate the fault and performfurther analysis on the abnormal hardware indicator.

4. If all hardware indicators are normal, check the middleware indicators such asthe thread pool, connection pool, and GC indicators. Perform further analysisbased on the abnormal middleware indicator.

5. If all middleware indicators are normal, check the database indicators such asthe slow query SQL indicators, hit ratio, locks, and parameter settings.

6. If the preceding indicators are normal, the algorithm, buffer, cache,synchronization, or asynchronization of the applications may be faulty.Perform further analysis.

Table 3-6 lists the possible bottlenecks.

Table 3-6 Possible bottlenecks

Bottleneck

Description

Hardware/Specifications

Problems of the CPU, memory, and drive I/O. The problems areclassified into server hardware bottlenecks and network bottlenecks(Network bottlenecks can be ignored in a LAN).

Middleware

Problems of software such as application servers and web servers, anddatabase systems. For example, a bottleneck may be caused ifparameters of the Java Database Connectivity (JDBC) connection poolconfigured on the WebLogic platform are set improperly.

Applications

Problems related to applications developed by developers. Forexample, when the system receives a large number of user requests,the following problems may cause low system performance, includingslow SQL statements and improper Java Virtual Machine (JVM)parameters, container settings, database design, program architectureplanning, and program design (insufficient threads for serial processingand request processing, no buffer, no cache, and mismatch betweenproducers and consumers).

OS Problems related to the OS such as Windows, UNIX, or Linux. Forexample, if the physical memory capacity is insufficient and the virtualmemory capacity is improper during a performance test, the virtualmemory swap efficiency may be greatly reduced. As a result, theresponse time is increased. This bottleneck is caused by the OS.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 47

Page 52: Tuning Guides - HUAWEI CLOUD

Bottleneck

Description

Networkdevices

Problems related to devices such as the firewalls, dynamic loadbalancers, and switches. Currently, more network access products areused in the cloud service architecture, including but not limited to theSLB, WAF, High Defense IP, CDN, and site acceleration. For example, ifa dynamic load distribution mechanism is set on the dynamic loadbalancer, the dynamic load balancer automatically sends subsequenttransaction requests to low-load servers when the hardware resourceusage of a server reaches the limit. If the dynamic load balancer doesnot function as expected in the test, the problem is a networkbottleneck.

General tuning procedure:

Figure 3-3 shows the general tuning procedure.

Figure 3-3 General tuning procedure

3.2 Cold Storage

3.2.1 Hardware Tuning

DIMM Installation Mode Tuning● Purpose

Populate one DIMM Per Channel (1DPC) to maximize the memoryperformance. That is, populate the DIMM 0 slot of each channel.

● ProcedurePreferentially populate DIMM 0 slots (DIMM 000, 010, 020, 030, 040, 050,100, 110, 120, 130, 140, and 150). Of the three digits in the DIMM slot

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 48

Page 53: Tuning Guides - HUAWEI CLOUD

number, the first digit indicates the CPU, the second digit indicates the DIMMchannel, and the third digit indicates the DIMM. Populate the DIMM slotswhose third digit is 0 in ascending order.

3.2.2 System Tuning

Optimizing the Network Performance● Purpose

This test uses the 25GE Ethernet adapter (Hi1822) with four ports, SFP+. It isused as an example to describe how to optimize the NIC parameters for theoptimal performance.

● ProcedureThe optimization methods include adjusting NIC parameters and interrupt-core binding (binding interrupts to the physical CPU of the NIC). Table 3-7describes the optimization items.

Table 3-7 NIC parameters

Parameter Description Suggestion

irqbalance System interruptbalancing service, whichautomatically allocatesNIC software interrupts toidle CPUs.

Default value: activeSymptom: When thisfunction is enabled, thesystem automaticallyallocates NIC softwareinterrupts to idle CPUs.Suggestion:● To disable irqbalance,

set this parameter toinactive.systemctl stop irqbalance

● Keep the functiondisabled after theserver is restarted.systemctl disable irqbalance

rx_buff Aggregation of largenetwork packets requiresmultiple discontinuousmemory pages and causeslow memory usage. Youcan increase the value ofthis parameter to improvethe memory usage.

Default value: 2Symptom: When the valueis set to 2 by default,interrupts consume alarge number of CPUresources.Suggestion: Load therx_buff parameter and setthe value to 8 to reducediscontinuous memoryand improve memoryusage and performance.For details, see descriptionfollowing the table.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 49

Page 54: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

ring_buffer You can increase thethroughput by adjustingthe NIC buffer size.

Default value: 1024Symptom: Run theethtool -g NIC namecommand to view thevalue.Suggestion: Change thering_buffer queue size to4096. For details, seedescription following thetable.

lro lro indicates large receiveoffload. After this functionis enabled, multiple smallpackets are aggregatedinto one large packet forbetter efficiency.

Default value: offSymptom: After thisfunction is enabled, themaximum throughputincreases significantly.Suggestion: Enable thelarge-receive-offloadfunction to help networksimprove the efficiency ofsending and receivingpackets. For details, seedescription following thetable.

hinicadm_lro_-ihinic0_-t_<NUM>

Received aggregatedpackets are sent after thetime specified by NUM (inmicroseconds). You canset the value to 256microseconds for betterefficiency.

Default value: 16microsecondsSymptom: This parameteris used with the LROfunction.Suggestion: Change thevalue to 256microseconds.

hinicadm_lro_-i_hinic0_-n_<NUM>

Received aggregatedpackets are sent after thenumber of aggregatedpackets reaches the valuespecified by <NUM>. Youcan set the value to 32 forbetter efficiency.

Default value: 4Symptom: This parameteris used with the LROfunction.Suggestion: Change thevalue to 32.

– Adjusting rx_buff

i. Go to the /etc/modprobe.d directory.cd /etc/modprobe.d

ii. Create the hinic.conf file.vi /etc/modprobe.d/hinic.conf

Add the following information to the file:

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 50

Page 55: Tuning Guides - HUAWEI CLOUD

options hinic rx_buff=8

iii. Reload the driver.rmmod hinicmodprobe hinic

iv. Check whether the value of rx_buff is changed to 8.cat /sys/bus/pci/drivers/hinic/module/parameters/rx_buff

– Adjusting ring_buffer

i. Change the buffer size from the default value 1024 to 4096.ethtool -G <NIC name> rx 4096 tx 4096

ii. Check the current buffer size.ethtool -g <NIC name>

– Enabling LRO

i. Enable the LRO function for a NIC.ethtool -K <NIC name> lro on

ii. Check whether the function is enabled.ethtool -k <NIC name> | grep large-receive-offload

NO TE

In addition to optimizing the preceding parameters, you need to bind the NIC softwareinterrupts to the cores.

1. Disable the irqbalance service.

2. Query the NUMA node to which the NIC belongs:cat /sys/class/net/<Network port name>/device/numa_node

3. Query the CPU cores that correspond to the NUMA node.lscpu

4. Query the interrupt ID corresponding to the NIC.cat /proc/interrupts | grep <Network port name> | awk -F ':' '{print $1}'

5. Bind the software interrupt to the core corresponding to the NUMA node.echo <core number> > /proc/irq/ <Interrupt ID> /smp_affinity_list

Enabling SMMU Passthrough● Purpose

To maximize the performance of the Kunpeng processor, you are advised toenable SMMU passthrough.

● Procedure

Step 1 Edit the /etc/grub2-efi.cfg file.vi /etc/grub2-efi.cfg

Step 2 Find the line where vmlinuz-4.14.0-115.el7a.0.1.aarch64 is located in the kernelcode, add iommu.passthrough=1 to the end of the line, save the file and exit, andrestart the server.if [ x$feature_platform_search_hint = xy ]; thensearch --no-floppy --fs-uuid --set=root e343026a-d245-4812-95ce-6ff999b3571celsesearch --no-floppy --fs-uuid --set=root e343026a-d245-4812-95ce-6ff999b3571cfilinux /vmlinuz-4.14.0-115.el7a.0.1.aarch64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap LANG=en_US.UTF-8 iommu.passthrough=1initrd /initramfs-4.14.0-115.el7a.0.1.aarch64.img

----End

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 51

Page 56: Tuning Guides - HUAWEI CLOUD

NO TE

This tuning procedure applies only to the Kunpeng computing platform.

4.14.0-115.el7a.0.1.aarch64 is the kernel version of CentOS 7.6. If you use another OS, runthe uname -r command to query the current kernel version, and addiommu.passthrough=1 at the end of the line where vmlinuz-Kernel version is located.

Optimizing the OS Configuration● Purpose

Adjust the system configuration to maximize the hardware performance.

● Procedure

Table 3-8 lists the optimization items.

Table 3-8 OS configuration parameters

Parameter Description Suggestion ConfigurationMethod

vm.swappiness

The swappartition is thevirtual memoryof the system.Do not use theswap partitionbecause it willdeterioratesystemperformance.

Default value: 60Symptom: Theperformancedeterioratessignificantly whenthe swappartition is used.Suggestion:Disable the swappartition and setthis parameter to0.

Run the followingcommand:sudo sysctl vm.swappiness=0

MTU Maximum size ofa data packetthat can passthrough a NIC.After the value isincreased, thenumber ofnetwork packetscan be reducedand theefficiency can beimproved.

Default value:1500 bytesSymptom: Runthe ip addrcommand to viewthe value.Suggestion: Setthe maximumsize of a datapacket that canpass through aNIC to 9000bytes.

1. Run thefollowingcommand:vi /etc/sysconfig/network-scripts/ifcfg-$(Interface)AddMTU="9000".NOTE

${Interface}indicates thenetwork portname.

2. After theconfiguration iscomplete, restartthe networkservice.service network restart

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 52

Page 57: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion ConfigurationMethod

pid_max The defaultvalue ofpid_max is32768, which issufficient innormal cases.However, whenheavy workloadsare beingprocessed, thisvalue isinsufficient andmay causememoryallocationfailure.

Default value:32768Symptom: Runthe cat /proc/sys/kernel/pid_maxcommand to viewthe value.Suggestion: Setthe maximumnumber ofthreads that canbe generated inthe system to4194303.

Run the followingcommand:echo 4194303 > /proc/sys/kernel/pid_max

file_max Maximumnumber of filesthat can beopened by allprocesses in thesystem. Inaddition, someprograms cancall the setrlimitinterface to setthe limit on eachprocess. If thesystem generatesa large numberof errorsindicating thatfile handles areused up, increasethe value of thisparameter.

Default value:13291808Symptom: Runthe cat /proc/sys/fs/file-max command toview the value.Suggestion: Setthe maximumnumber of filesthat can beopened by allprocesses in thesystem to thevalue displayedafter the cat /proc/meminfo |grep MemTotal |awk '{print $2}'command is run.

Run the followingcommand:echo ${file-max} > /proc/sys/fs/file-max

NOTE${file-max} is thevalue displayed afterthe cat /proc/meminfo | grepMemTotal | awk'{print $2}' is run.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 53

Page 58: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion ConfigurationMethod

read_ahead Linux readaheadmeans that theLinux kernelprefetches acertain area ofthe specified fileand loads it intothe page cache.As a result, whenthe area isaccessedsubsequently,block caused bypage fault willnot occur.Reading datafrom memory ismuch faster thanreading datafrom drives.Therefore, thereadaheadfeature caneffectivelyreduce thenumber of driveseeks and theI/O waiting timeof theapplications. It isone of theimportantmethods foroptimizing thedrive read I/Operformance.

Default value:128 KBSymptom:Readahead caneffectively reducethe number ofdrive seeks andthe I/O waitingtime of theapplications. Runthe /sbin/blockdev --getra /dev/sdb toview the value.Suggestion:Change the valueto 8192 KB.Improve the driveread efficiency bypre-reading andrecording thedata to randomaccess memory(RAM).

Run the followingcommand:/sbin/blockdev --setra /dev/sdb

NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 54

Page 59: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion ConfigurationMethod

I/O_Scheduler

The Linux I/Oscheduler is acomponent ofthe Linux kernel.You can adjustthe scheduler tooptimize systemperformance.

Default value:CFQSymptom: TheLinux I/Oscheduler needsto be configuredbased ondifferent storagedevices for theoptimal systemperformance.Suggestion: Setthe I/Oscheduling policyto deadline forHDDs and noopfor SSDs.

Run the followingcommand:echo deadline > /sys/block/sdb/queue/scheduler

NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.

nr_requests If the Linuxsystem receives alarge number ofread requests,the defaultnumber ofrequest queuesmay beinsufficient. Todeal with thisproblem, you candynamicallyadjust thedefault numberof requestqueues inthe /sys/block/hda/queue/nr_requests file.

Default value:128Symptom:Increase the drivethroughput byadjusting thenr_requestsparameter.Suggestion: Setthe number ofdrive requestqueues to 512.

Run the followingcommand:echo 512 > /sys/block/sdb/queue/nr_requests

NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.

3.2.3 Ceph Tuning

Tuning Ceph Configuration● Purpose

Modify the Ceph configuration to maximize system resource utilization.● Procedure

You can modify the Ceph configuration in the /etc/ceph/ceph.conf file. Forexample, to change the number of copies to 4, you can add

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 55

Page 60: Tuning Guides - HUAWEI CLOUD

osd_pool_default_size = 4 to the /etc/ceph/ceph.conf file and run thesystemctl restart ceph.target command to restart the Ceph daemon processfor the change to take effect.The setting takes effect only for the current Ceph node. To enable the settingsof the entire Ceph cluster to take effect, you need to modify the ceph.conffile of each Ceph node and restart the Ceph daemon process. Table 3-9describes the Ceph parameters to be modified.

Table 3-9 Ceph parameters

Parameter Description Suggestion

[global]

cluster_network Configures a networksegment different fromthe public network. Thisnetwork segment is usedfor replication and databalancing between OSDsto relieve the pressureon the public network.

Configure a networksegment that is differentfrom the public networksegment and set the valueto, for example,192.168.4.0/24.

public_network Configure a networksegment that is differentfrom the cluster networksegment and set the valueto, for example,192.168.3.0/24.

Table 3-10 describes other parameters that can be modified.

Table 3-10 Other parameters

Parameter Description Suggestion

[global]

osd_pool_default_min_size

Minimum number of I/Ocopies that the PG canreceive. If a PG is in thedegraded state, its I/Ocapability is notaffected.

Default value: 0Recommended value: 1

cluster_network Configures a networksegment different fromthe public network. Thisnetwork segment is usedfor replication and databalancing betweenOSDs to relieve thepressure on the publicnetwork.

Recommended value:192.168.4.0/24

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 56

Page 61: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

osd_pool_default_size

Specifies the number ofcopies.

Default value: 3Recommended value: 3

osd_memory_target

Specifies the size ofmemory that each OSDprocess is allowed toobtain.

Default value: 4294967296Recommended value:4294967296

[mon]

mon_clock_drift_allowed

Specifies the clock driftbetween MONs.

Default value: 0.05Recommended value: 1

mon_osd_min_down_reporters

Specifies the minimumnumber of down OSDsthat triggers a report tothe MONs.

Default value: 2Recommended value: 13

mon_osd_down_out_interval

Specifies the duration(in seconds) for whichCeph waits before anOSD is marked as downor out.

Default value: 600Recommended value: 600

[OSD]

osd_journal_size Specifies the OSDjournal size.

Default value: 5120Recommended value:20000

osd_max_write_size

Specifies the maximumsize (in MB) of data thatcan be written by anOSD at a time.

Default value: 90Recommended value: 512

osd_client_message_size_cap

Specifies the maximumsize (in bytes) of datathat can be stored in thememory by the clients.

Default value: 100Recommended value:2147483648

osd_deep_scrub_stride

Specifies the number ofbytes that can be readduring deep scrubbing.

Default value: 524288Recommended value:131072

osd_map_cache_size

Specifies the size of thecache (in MB) thatstores the OSD map.

Default value: 50Recommended value: 1024

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 57

Page 62: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

osd_recovery_op_priority

Specifies the priority ofthe restorationoperation. The valueranges from 1 to 63. Alarger value indicateshigher resource usage.

Default value: 3Recommended value: 2

osd_recovery_max_active

Specifies the maximumnumber of activerestoration requestsallowed at the sametime.

Default value: 3Recommended value: 10

osd_max_backfills Specifies the maximumnumber of backfillsallowed by an OSD.

Default value: 1Recommended value: 4

osd_min_pg_log_entries

Specifies the maximumnumber of PGLogs thatcan be recorded whenthe PG is normal.

Default value: 3000Recommended value:30000

osd_max_pg_log_entries

Specifies the maximumnumber of PGLogs thatcan be recorded whenthe PG is degraded.

Default value: 3000Recommended value:100000

osd_mon_heartbeat_interval

Specifies the interval (inseconds) for an OSD toping a MON.

Default value: 30Recommended value: 40

ms_dispatch_throttle_bytes

Specifies the maximumnumber of messages tobe dispatched.

Default value: 10485760Recommended value:1048576000

objecter_inflight_ops

Specifies the maximumnumber of unsent I/Orequests allowed. Thisparameter is used forclient traffic control. Ifthe number of unsentI/O requests exceeds thethreshold, theapplication I/O isblocked. The value 0indicates that thenumber of unsent I/Orequests is not limited.

Default value: 1024Recommended value:819200

osd_op_log_threshold

Specifies the number ofoperation logs displayedat a time.

Default value: 5Recommended value: 50

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 58

Page 63: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

osd_crush_chooseleaf_type

Specifies the bucket typewhen the CRUSH ruleuses chooseleaf.

Default value: 1Recommended value: 0

journal_max_write_bytes

Specifies the maximumnumber of bytes thatcan be written to ajournal at a time.

Default value: 1048560Recommended value:1073714824

journal_max_write_entries

Specifies the maximumnumber of records thatcan be written to ajournal at a time.

Default value: 100Recommended value:10000

[Client]

rbd_cache Specifies the RBD cache. Default value: True(indicating that the RBDcache is enabled)Recommended value: True

rbd_cache_size Specifies the RBD cachesize (in bytes).

Default value: 33554432Recommended value:335544320

rbd_cache_max_dirty

Specifies the maximumnumber of dirty bytesallowed when the cacheis set to the writebackmode. If the value is 0,the cache is set to thewritethrough mode.

Default value: 25165824Recommended value:134217728

rbd_cache_max_dirty_age

Specifies the duration(in seconds) for whichthe dirty data is storedin the cache beforebeing flushed to thedrives.

Default value: 1Recommended value: 30

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 59

Page 64: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

rbd_cache_writethrough_until_flush

This parameter is usedto ensure compatibilitywith the VirtIO driverearlier than Linux-2.6.32.It allows data to bewritten back when noflush request is sent. Ifthis parameter is set toTrue, librbd processesI/Os in writethroughmode, and switches tothe writeback mode onlywhen the first flushrequest is received.

Default value: TrueRecommended value: False

rbd_cache_max_dirty_object

Specifies the maximumnumber of objects. Thedefault value is 0, whichindicates that thenumber of objects iscalculated based on theRBD cache size. Bydefault, librbd logicallysplits the drive image ina unit of 4 MB.Each chunk object isabstracted as an object.librbd manages thecache objects. You canincrease the value of thisparameter to improvethe performance.

Default value: 0Recommended value: 2

rbd_cache_target_dirty

Specifies the dirty datasize that triggerswriteback. The valuecannot exceed the valueofrbd_cache_max_dirty.

Default value: 16777216Recommended value:235544320

Optimizing the PG Distribution● Purpose

Adjust the number of PGs on each OSD to balance the load on each OSD.● Procedure

By default, Ceph allocates eight PGs/PGPs to each storage pool. Whencreating a storage pool, run the ceph osd pool create {pool-name} {pg-num}{pgp-num} command to specify the number of PGs/PGPs, or run the ceph osdpool set {pool_name} pg_num {pg-num} and ceph osd pool set {pool_name}

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 60

Page 65: Tuning Guides - HUAWEI CLOUD

pgp_num {pgp-num} command to change the number of PGs/PGPs createdin a storage pool. After the modification, run the ceph osd pool get{pool_name} pg_num/pgp_num command to check the number of PGs/PGPsin the storage pool.The default value of ceph balancer mode is none. You can run the cephbalancer mode upmap command to change it to upmap. The Ceph balancerfunction is disabled by default. You can run the ceph balancer on/offcommand is used to enable or disable the Ceph balancer function.Table 3-11 describes the PG distribution parameters.

Table 3-11 PG distribution parameters

Parameter Description Suggestion

pg_num Total PGs =(Total_number_of_OSDx 100)/max_replication_countRound up the result tothe nearest integerpower of 2.

Default value: 8Symptom: A warning isdisplayed if the number ofPGs is insufficient.Suggestion: Calculate thevalue based on theformula.

pgp_num Set the number of PGPsto be the same as thatof PGs.

Default value: 8Symptom: It isrecommended that thenumber of PGPs be thesame as the number ofPGs.Suggestion: Calculate thevalue based on theformula.

ceph_balancer_mode

Enable the balancerplug-in and set theplug-in mode toupmap.

Default value: noneSymptom: If the number ofPGs is unbalanced, someOSDs may be overloadedand become bottlenecks.Recommended value:upmap

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 61

Page 66: Tuning Guides - HUAWEI CLOUD

NO TE

● The number of PGs carried by each OSD must be the same or close. Otherwise,some OSDs may be overloaded and become bottlenecks. The balancer plug-in canbe used to optimize the PG distribution. You can run the ceph balancer eval orceph pg dump command to view the PG distribution.

● Run the eph balancer mode upmap and ceph balancer on commands toautomatically balance and optimize Ceph PGs. Ceph adjusts the distribution of afew PGs every 60 seconds. Run the ceph balancer eval or ceph pg dumpcommand to view the PG distribution. If the PG distribution does not change, thedistribution is optimal.

● The PG distribution of each OSD affects the load balancing of write data. Inaddition to optimizing the number of PGs corresponding to each OSD, thedistribution of the primary PGs also needs to be optimized. That is, the primaryPGs need to be distributed to each OSD as evenly as possible.

Binding OSDs and RGWs to CPU Cores● Purpose

Bind the OSD and RGW processes to fixed CPU cores to prevent certain CPUcores from being overloaded.

● ProcedureWhen NIC software interrupts and Ceph processes share CPUs under heavynetwork load, certain CPUs may be overloaded and become bottlenecks,compromising the Ceph cluster performance. To solve the problem, bind theNIC software interrupts and Ceph processes to different CPU cores. Table3-12 describes the parameters to be modified.

Table 3-12 Binding OSDs and RGWs to CPU cores

Parameter Description Suggestion

osd.[N] Binds the osd.n daemonprocess to a specifiedidle NUMA node, whichis a node other than thenodes that process NICsoftware interrupts.

Default value: noneSuggestion: Bind the osd.Ndaemon process to specifiedCPU cores that do not processNIC software interrupts toprevent the CPU frombecoming bottlenecks.

rgw.[N] Binds the RGW daemonprocess to a specifiedidle NUMA node, whichis a node other than thenodes that process theNIC software interrupt.

Default value: noneBind the rgw.N daemonprocess to CPU cores that dono process NIC softwareinterrupts to prevent the CPUfrom becoming thebottlenecks.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 62

Page 67: Tuning Guides - HUAWEI CLOUD

NO TE

The Ceph OSD/RGW daemon process and NIC software interrupt process must run ondifferent CPU cores. Otherwise, CPU bottlenecks may occur when the network load isheavy.

Run the following commands on all Ceph nodes to bind the CPU cores:for i in `ps -ef | grep rgw | grep -v grep | awk '{print $2}'`; do taskset -pc 4-47 $i; donefor i in `ps -ef | grep osd | grep -v grep | awk '{print $2}'`; do taskset -pc 4-47 $i; done

NO TE

Optimizing the Network Performance describes how to bind NIC software interruptsto the CPU core of the NUMA node to which the NIC belongs. When the network loadis heavy, the usage of the CPU core bound to the software interrupts is high.Therefore, you are advised to set osd_numa_node to a NUMA node different fromthat of the NIC. For example, run the cat /sys/class/net/<Port Name>/device/numa_node command to query the NUMA node of the NIC. If the NIC belongs toNUMA node 2, set osd_numa_node = 0 or osd_numa_node = 1 to prevent the OSDand NIC software interrupt from using the same CPU core. The core binding of theRGW is similar to that of the OSD. After finding an idle NUMA node, you can run thelscpu command to query the ID of the CPU core that corresponds to the NUMA node.In the preceding command line, 4-47 indicates the idle CPU core of the node. Changethe value as required.

3.3 General-Purpose Storage

3.3.1 Hardware Tuning

NVMe SSD Tuning● Purpose

Reduce cross-chip data overheads.● Procedure

Install the NVMe SSDs and NIC into the same riser card.

DIMM Installation Mode Tuning● Purpose

Populate one DIMM Per Channel (1DPC) to maximize the memoryperformance. That is, populate the DIMM 0 slot of each channel.

● ProcedurePreferentially populate DIMM 0 slots (DIMM 000, 010, 020, 030, 040, 050,100, 110, 120, 130, 140, and 150). Of the three digits in the DIMM slotnumber, the first digit indicates the CPU, the second digit indicates the DIMMchannel, and the third digit indicates the DIMM. Populate the DIMM slotswhose third digit is 0 in ascending order.

3.3.2 System Tuning

Optimizing the Network Performance● Purpose

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 63

Page 68: Tuning Guides - HUAWEI CLOUD

This test uses the 25GE Ethernet adapter (Hi1822) with four ports, SFP+. It isused as an example to describe how to optimize the NIC parameters for theoptimal performance.

● ProcedureThe optimization methods include adjusting NIC parameters and interrupt-core binding (binding interrupts to the physical CPU of the NIC). Table 3-13describes the optimization items.

Table 3-13 NIC parameters

Parameter Description Suggestion

irqbalance System interruptbalancing service, whichautomatically allocatesNIC software interrupts toidle CPUs.

Default value: activeSymptom: When thisfunction is enabled, thesystem automaticallyallocates NIC softwareinterrupts to idle CPUs.Suggestion:● To disable irqbalance,

set this parameter toinactive.systemctl stop irqbalance

● Keep the functiondisabled after theserver is restarted.systemctl disable irqbalance

rx_buff Aggregation of largenetwork packets requiresmultiple discontinuousmemory pages and causeslow memory usage. Youcan increase the value ofthis parameter to improvethe memory usage.

Default value: 2Symptom: When the valueis set to 2 by default,interrupts consume alarge number of CPUresources.Suggestion: Load therx_buff parameter and setthe value to 8 to reducediscontinuous memoryand improve memoryusage and performance.For details, see descriptionfollowing the table.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 64

Page 69: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

ring_buffer You can increase thethroughput by adjustingthe NIC buffer size.

Default value: 1024Symptom: Run theethtool -g NIC namecommand to view thevalue.Suggestion: Change thering_buffer queue size to4096. For details, seedescription following thetable.

lro lro indicates large receiveoffload. After this functionis enabled, multiple smallpackets are aggregatedinto one large packet forbetter efficiency.

Default value: offSymptom: After thisfunction is enabled, themaximum throughputincreases significantly.Suggestion: Enable thelarge-receive-offloadfunction to help networksimprove the efficiency ofsending and receivingpackets. For details, seedescription following thetable.

hinicadm_lro_-ihinic0_-t_<NUM>

Received aggregatedpackets are sent after thetime specified by NUM (inmicroseconds). You canset the value to 256microseconds for betterefficiency.

Default value: 16microsecondsSymptom: This parameteris used with the LROfunction.Suggestion: Change thevalue to 256microseconds.

hinicadm_lro_-i_hinic0_-n_<NUM>

Received aggregatedpackets are sent after thenumber of aggregatedpackets reaches the valuespecified by <NUM>. Youcan set the value to 32 forbetter efficiency.

Default value: 4Symptom: This parameteris used with the LROfunction.Suggestion: Change thevalue to 32.

– Adjusting rx_buff

i. Go to the /etc/modprobe.d directory.cd /etc/modprobe.d

ii. Create the hinic.conf file.vi /etc/modprobe.d/hinic.conf

Add the following information to the file:

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 65

Page 70: Tuning Guides - HUAWEI CLOUD

options hinic rx_buff=8

iii. Reload the driver.rmmod hinicmodprobe hinic

iv. Check whether the value of rx_buff is changed to 8.cat /sys/bus/pci/drivers/hinic/module/parameters/rx_buff

– Adjusting ring_buffer

i. Change the buffer size from the default value 1024 to 4096.ethtool -G <NIC name> rx 4096 tx 4096

ii. Check the current buffer size.ethtool -g <NIC name>

– Enabling LRO

i. Enable the LRO function for a NIC.ethtool -K <NIC name> lro on

ii. Check whether the function is enabled.ethtool -k <NIC name> | grep large-receive-offload

NO TE

In addition to optimizing the preceding parameters, you need to bind the NIC softwareinterrupts to the cores.

1. Disable the irqbalance service.

2. Query the NUMA node to which the NIC belongs:cat /sys/class/net/<Network port name>/device/numa_node

3. Query the CPU cores that correspond to the NUMA node.lscpu

4. Query the interrupt ID corresponding to the NIC.cat /proc/interrupts | grep <Network port name> | awk -F ':' '{print $1}'

5. Bind the software interrupt to the core corresponding to the NUMA node.echo <core number> > /proc/irq/ <Interrupt ID> /smp_affinity_list

Enabling SMMU Passthrough● Purpose

To maximize the performance of the Kunpeng processor, you are advised toenable SMMU passthrough.

● Procedure

Step 1 Edit the /etc/grub2-efi.cfg file.vi /etc/grub2-efi.cfg

Step 2 Find the line where vmlinuz-4.14.0-115.el7a.0.1.aarch64 is located in the kernelcode, add iommu.passthrough=1 to the end of the line, save the file and exit, andrestart the server.if [ x$feature_platform_search_hint = xy ]; thensearch --no-floppy --fs-uuid --set=root e343026a-d245-4812-95ce-6ff999b3571celsesearch --no-floppy --fs-uuid --set=root e343026a-d245-4812-95ce-6ff999b3571cfilinux /vmlinuz-4.14.0-115.el7a.0.1.aarch64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap LANG=en_US.UTF-8 iommu.passthrough=1initrd /initramfs-4.14.0-115.el7a.0.1.aarch64.img

----End

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 66

Page 71: Tuning Guides - HUAWEI CLOUD

NO TE

This tuning procedure applies only to the Kunpeng computing platform.

4.14.0-115.el7a.0.1.aarch64 is the kernel version of CentOS 7.6. If you use another OS, runthe uname -r command to query the current kernel version, and addiommu.passthrough=1 at the end of the line where vmlinuz-Kernel version is located.

Optimizing the OS Configuration● Purpose

Adjust the system configuration to maximize the hardware performance.

● Procedure

Table 3-14 lists the optimization items.

Table 3-14 OS configuration parameters

Parameter Description Suggestion ConfigurationMethod

vm.swappiness

The swappartition is thevirtual memoryof the system.Do not use theswap partitionbecause it willdeterioratesystemperformance.

Default value: 60Symptom: Theperformancedeterioratessignificantly whenthe swappartition is used.Suggestion:Disable the swappartition and setthis parameter to0.

Run the followingcommand:sudo sysctl vm.swappiness=0

MTU Maximum size ofa data packetthat can passthrough a NIC.After the value isincreased, thenumber ofnetwork packetscan be reducedand theefficiency can beimproved.

Default value:1500 bytesSymptom: Runthe ip addrcommand to viewthe value.Suggestion: Setthe maximumsize of a datapacket that canpass through aNIC to 9000bytes.

1. Run thefollowingcommand:vi /etc/sysconfig/network-scripts/ifcfg-$(Interface)AddMTU="9000".NOTE

${Interface}indicates thenetwork portname.

2. After theconfiguration iscomplete, restartthe networkservice.service network restart

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 67

Page 72: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion ConfigurationMethod

pid_max The defaultvalue ofpid_max is32768, which issufficient innormal cases.However, whenheavy workloadsare beingprocessed, thisvalue isinsufficient andmay causememoryallocationfailure.

Default value:32768Symptom: Runthe cat /proc/sys/kernel/pid_maxcommand to viewthe value.Suggestion: Setthe maximumnumber ofthreads that canbe generated inthe system to4194303.

Run the followingcommand:echo 4194303 > /proc/sys/kernel/pid_max

file_max Maximumnumber of filesthat can beopened by allprocesses in thesystem. Inaddition, someprograms cancall the setrlimitinterface to setthe limit on eachprocess. If thesystem generatesa large numberof errorsindicating thatfile handles areused up, increasethe value of thisparameter.

Default value:13291808Symptom: Runthe cat /proc/sys/fs/file-max command toview the value.Suggestion: Setthe maximumnumber of filesthat can beopened by allprocesses in thesystem to thevalue displayedafter the cat /proc/meminfo |grep MemTotal |awk '{print $2}'command is run.

Run the followingcommand:echo ${file-max} > /proc/sys/fs/file-max

NOTE${file-max} is thevalue displayed afterthe cat /proc/meminfo | grepMemTotal | awk'{print $2}' is run.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 68

Page 73: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion ConfigurationMethod

read_ahead Linux readaheadmeans that theLinux kernelprefetches acertain area ofthe specified fileand loads it intothe page cache.As a result, whenthe area isaccessedsubsequently,block caused bypage fault willnot occur.Reading datafrom memory ismuch faster thanreading datafrom drives.Therefore, thereadaheadfeature caneffectivelyreduce thenumber of driveseeks and theI/O waiting timeof theapplications. It isone of theimportantmethods foroptimizing thedrive read I/Operformance.

Default value:128 KBSymptom:Readahead caneffectively reducethe number ofdrive seeks andthe I/O waitingtime of theapplications. Runthe /sbin/blockdev --getra /dev/sdb toview the value.Suggestion:Change the valueto 8192 KB.Improve the driveread efficiency bypre-reading andrecording thedata to randomaccess memory(RAM).

Run the followingcommand:/sbin/blockdev --setra /dev/sdb

NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 69

Page 74: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion ConfigurationMethod

I/O_Scheduler

The Linux I/Oscheduler is acomponent ofthe Linux kernel.You can adjustthe scheduler tooptimize systemperformance.

Default value:CFQSymptom: TheLinux I/Oscheduler needsto be configuredbased ondifferent storagedevices for theoptimal systemperformance.Suggestion: Setthe I/Oscheduling policyto deadline forHDDs and noopfor SSDs.

Run the followingcommand:echo deadline > /sys/block/sdb/queue/scheduler

NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.

nr_requests If the Linuxsystem receives alarge number ofread requests,the defaultnumber ofrequest queuesmay beinsufficient. Todeal with thisproblem, you candynamicallyadjust thedefault numberof requestqueues inthe /sys/block/hda/queue/nr_requests file.

Default value:128Symptom:Increase the drivethroughput byadjusting thenr_requestsparameter.Suggestion: Setthe number ofdrive requestqueues to 512.

Run the followingcommand:echo 512 > /sys/block/sdb/queue/nr_requests

NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.

3.3.3 Ceph Tuning

Modifying Ceph Configuration● Purpose

Adjust the Ceph configuration to maximize system resource usage.● Procedure

You can edit the /etc/ceph/ceph.conf file to modify all Ceph configurationparameters. For example, you can osd_pool_default_size = 4 to the /etc/

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 70

Page 75: Tuning Guides - HUAWEI CLOUD

ceph/ceph.conf file to change the default number of copies to 4 and run thesystemctl restart ceph.target command to restart the Ceph daemon processfor the change to take effect.The preceding operations take effect only on the current Ceph node. You needto modify the ceph.conf file on all Ceph nodes and restart the Ceph daemonprocess for the modification to take effect on the entire Ceph cluster. Table3-15 describes the Ceph optimization items.

Table 3-15 Ceph parameter configuration

Parameter Description Suggestion

[global]

cluster_network You can configure anetwork segmentdifferent from the publicnetwork for OSDreplication and databalancing to relieve thepressure on the publicnetwork.

Recommended value:192.168.4.0/24You can set this parameteras required as long as it isdifferent from the publicnetwork segment.

public_network Recommended value:192.168.3.0/24You can set this parameteras required as long as it isdifferent from the clusternetwork segment.

osd_pool_default_size

Number of copies Recommended value: 3

osd_memory_target

Size of memory thateach OSD process isallowed to obtain

Recommended value:4294967296

For details about how to optimize other parameters, see Table 3-16.

Table 3-16 Other parameter configuration

Parameter Description Suggestion

[global]

osd_pool_default_min_size

Minimum number of I/Ocopies that the PG canreceive. If a PG is in thedegraded state, its I/Ocapability is notaffected.

Default value: 0Recommended value: 1

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 71

Page 76: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

cluster_network You can configure anetwork segmentdifferent from the publicnetwork for OSDreplication and databalancing to relieve thepressure on the publicnetwork.

This parameter has nodefault value.Recommended value:192.168.4.0/24

osd_memory_target

Size of memory thateach OSD process isallowed to obtain

Default value: 4294967296Recommended value:4294967296

[mon]

mon_clock_drift_allowed

Clock drift betweenMONs

Default value: 0.05Recommended value: 1

mon_osd_min_down_reporters

Minimum down OSDquantity that triggers areport to the MONs

Default value: 2Recommended value: 13

mon_osd_down_out_interval

Number of seconds thatCeph waits before anOSD is marked as downor out

Default value: 600Recommended value: 600

[OSD]

osd_journal_size OSD journal size Default value: 5120Recommended value:20000

osd_max_write_size

Maximum size (in MB)of data that can bewritten by an OSD at atime

Default value: 90Recommended value: 512

osd_client_message_size_cap

Maximum size (in bytes)of data that can bestored in the memory bythe clients

Default value: 100Recommended value:2147483648

osd_deep_scrub_stride

Number of bytes thatcan be read during deepscrubbing

Default value: 524288Recommended value:131072

osd_map_cache_size

Size of the cache (inMB) that stores the OSDmap

Default value: 50Recommended value: 1024

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 72

Page 77: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

osd_recovery_op_priority

Restoration priority. Thevalue ranges from 1 to63. A larger valueindicates higher resourceusage.

Default value: 3Recommended value: 2

osd_recovery_max_active

Number of activerestoration requests inthe same period

Default value: 3Recommended value: 10

osd_max_backfills Maximum number ofbackfills allowed by anOSD

Default value: 1Recommended value: 4

osd_min_pg_log_entries

Minimum number ofreserved PG logs

Default value: 3000Recommended value:30000

osd_max_pg_log_entries

Maximum number ofreserved PG logs

Default value: 3000Recommended value:100000

osd_mon_heartbeat_interval

Interval (in seconds) foran OSD to ping a MON

Default value: 30Recommended value: 40

ms_dispatch_throttle_bytes

Maximum number ofmessages to bedispatched

Default value: 104857600Recommended value:1048576000

objecter_inflight_ops

Allowed maximumnumber of unsent I/Orequests. This parameteris used for client trafficcontrol. If the number ofunsent I/O requestsexceeds the threshold,the application I/O isblocked. The value 0indicates that thenumber of unsent I/Orequests is not limited.

Default value: 1024Recommended value:819200

osd_op_log_threshold

Number of operationlogs to be displayed at atime

Default value: 5Recommended value: 50

osd_crush_chooseleaf_type

Bucket type when theCRUSH rule useschooseleaf

Default value: 1Recommended value: 0

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 73

Page 78: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

journal_max_write_bytes

Maximum number ofjournal bytes that canbe written at a time

Default value: 10485760Recommended value:1073714824

journal_max_write_entries

Maximum number ofjournal records that canbe written at a time

Default value: 100Recommended value:10000

[Client]

rbd_cache RBD cache Default value: TrueRecommended value: True

rbd_cache_size RBD cache size (inbytes)

Default value: 33554432Recommended value:335544320

rbd_cache_max_dirty

Maximum number ofdirty bytes allowedwhen the cache is set tothe writeback mode. Ifthe value is 0, the cacheis set to thewritethrough mode.

Default value: 25165824Recommended value:134217728

rbd_cache_max_dirty_age

Duration (in seconds)for which the dirty datais stored in the cachebefore being flushed tothe drives

Default value: 1Recommended value: 30

rbd_cache_writethrough_until_flush

This parameter is usedfor compatibility withthe virtio driver earlierthan linux-2.6.32. Itprevents the situationthat data is written backwhen no flush request issent. After thisparameter is set, librbdprocesses I/Os inwritethrough mode. Themode is switched towriteback only after thefirst flush request isreceived.

Default value: TrueRecommended value: False

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 74

Page 79: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

rbd_cache_max_dirty_object

Maximum number ofobjects. The defaultvalue is 0, whichindicates that thenumber is calculatedbased on the RBD cachesize. By default, librbdlogically splits the driveimage in a unit of 4 MB.Each chunk object isabstracted as an object.librbd manages thecache object. You canincrease the value of thisparameter improve theperformance.

Default value: 0Recommended value: 2

rbd_cache_target_dirty

Dirty data size thattriggers writeback. Thevalue cannot exceed thevalue ofrbd_cache_max_dirty.

Default value: 16777216Recommended value:235544320

Optimizing the PG Distribution● Purpose

Adjust the number of PGs on each OSD to balance the load on each OSD.● Procedure

By default, Ceph allocates eight PGs/PGPs to each storage pool. Whencreating a storage pool, run the ceph osd pool create {pool-name} {pg-num}{pgp-num} command to specify the number of PGs/PGPs, or run the ceph osdpool set {pool_name} pg_num {pg-num} and ceph osd pool set {pool_name}pgp_num {pgp-num} command to change the number of PGs/PGPs createdin a storage pool. After the modification, run the ceph osd pool get{pool_name} pg_num/pgp_num command to check the number of PGs/PGPsin the storage pool.The default value of ceph balancer mode is none. You can run the cephbalancer mode upmap command to change it to upmap. The Ceph balancerfunction is disabled by default. You can run the ceph balancer on/offcommand is used to enable or disable the Ceph balancer function.Table 3-17 describes the PG distribution parameters.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 75

Page 80: Tuning Guides - HUAWEI CLOUD

Table 3-17 PG distribution parameters

Parameter Description Suggestion

pg_num Total PGs =(Total_number_of_OSDx 100)/max_replication_countRound up the result tothe nearest integerpower of 2.

Default value: 8Symptom: A warning isdisplayed if the number ofPGs is insufficient.Suggestion: Calculate thevalue based on theformula.

pgp_num Set the number of PGPsto be the same as thatof PGs.

Default value: 8Symptom: It isrecommended that thenumber of PGPs be thesame as the number ofPGs.Suggestion: Calculate thevalue based on theformula.

ceph_balancer_mode

Enable the balancerplug-in and set theplug-in mode toupmap.

Default value: noneSymptom: If the number ofPGs is unbalanced, someOSDs may be overloadedand become bottlenecks.Recommended value:upmap

NO TE

● The number of PGs carried by each OSD must be the same or close. Otherwise,some OSDs may be overloaded and become bottlenecks. The balancer plug-in canbe used to optimize the PG distribution. You can run the ceph balancer eval orceph pg dump command to view the PG distribution.

● Run the eph balancer mode upmap and ceph balancer on commands toautomatically balance and optimize Ceph PGs. Ceph adjusts the distribution of afew PGs every 60 seconds. Run the ceph balancer eval or ceph pg dumpcommand to view the PG distribution. If the PG distribution does not change, thedistribution is optimal.

● The PG distribution of each OSD affects the load balancing of write data. Inaddition to optimizing the number of PGs corresponding to each OSD, thedistribution of the primary PGs also needs to be optimized. That is, the primaryPGs need to be distributed to each OSD as evenly as possible.

Binding OSDs and RGWs to CPU Cores● Purpose

Bind the OSD and RGW processes to fixed CPU cores to prevent certain CPUcores from being overloaded.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 76

Page 81: Tuning Guides - HUAWEI CLOUD

● ProcedureWhen NIC software interrupts and Ceph processes share CPUs under heavynetwork load, certain CPUs may be overloaded and become bottlenecks,compromising the Ceph cluster performance. To solve the problem, bind theNIC software interrupts and Ceph processes to different CPU cores. Table3-18 describes the parameters to be modified.

Table 3-18 Binding OSDs and RGWs to CPU cores

Parameter Description Suggestion

osd.[N] Binds the osd.n daemonprocess to a specifiedidle NUMA node, whichis a node other than thenodes that process NICsoftware interrupts.

Default value: noneSuggestion: Bind the osd.Ndaemon process to specifiedCPU cores that do not processNIC software interrupts toprevent the CPU frombecoming bottlenecks.

rgw.[N] Binds the RGW daemonprocess to a specifiedidle NUMA node, whichis a node other than thenodes that process theNIC software interrupt.

Default value: noneBind the rgw.N daemonprocess to CPU cores that dono process NIC softwareinterrupts to prevent the CPUfrom becoming thebottlenecks.

NO TE

The Ceph OSD/RGW daemon process and NIC software interrupt process must run ondifferent CPU cores. Otherwise, CPU bottlenecks may occur when the network load isheavy.

Run the following commands on all Ceph nodes to bind the CPU cores:for i in `ps -ef | grep rgw | grep -v grep | awk '{print $2}'`; do taskset -pc 4-47 $i; donefor i in `ps -ef | grep osd | grep -v grep | awk '{print $2}'`; do taskset -pc 4-47 $i; done

NO TE

Optimizing the Network Performance describes how to bind NIC software interruptsto the CPU core of the NUMA node to which the NIC belongs. When the network loadis heavy, the usage of the CPU core bound to the software interrupts is high.Therefore, you are advised to set osd_numa_node to a NUMA node different fromthat of the NIC. For example, run the cat /sys/class/net/<Port Name>/device/numa_node command to query the NUMA node of the NIC. If the NIC belongs toNUMA node 2, set osd_numa_node = 0 or osd_numa_node = 1 to prevent the OSDand NIC software interrupt from using the same CPU core. The core binding of theRGW is similar to that of the OSD. After finding an idle NUMA node, you can run thelscpu command to query the ID of the CPU core that corresponds to the NUMA node.In the preceding command line, 4-47 indicates the idle CPU core of the node. Changethe value as required.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 77

Page 82: Tuning Guides - HUAWEI CLOUD

Optimizing Compression Algorithm Configuration Parameters● Purpose

Adjust the compression algorithm configuration parameters to optimize theperformance of the compression algorithm.

● Procedure

The default value of bluestore_min_alloc_size_hdd for Ceph is 32 KB. Thevalue of this parameter affects the size of the final data obtained after thecompression algorithm is run. Set this parameter to a smaller value tomaximize the compression ratio of the compression algorithm.

By default, Ceph uses five threads to process I/O requests in an OSD process.After the compression algorithm is enabled, the number of threads can causea performance bottleneck. Increase the number of threads to maximize theperformance of the compression algorithm.

The following table describes the placement group (PG) distributionparameter configuration:

Parameter Description Suggestion

bluestore_min_alloc_size_hdd

Minimum size of objectsallocated to the HDDdata disks in theBlueStore storageengine

Default value: 32768Recommended value: 8192

osd_op_num_shards_hdd

Number of shards for anHDD data disk in anOSD process

Default value: 5Recommended value: 12

osd_op_num_threads_per_shard_hdd

Average number ofthreads of an OSDprocess for each HDDdata disk shard

Default value: 1Recommended value: 2

Using the I/O Passthrough Tool for Optimization

The I/O passthrough tool is a process optimization tool for balanced scenarios ofthe Ceph cluster. It can automatically detect and optimize OSDs in the Cephcluster. For details on how to use this tool, see the I/O Passthrough Tool UserGuide.

3.3.4 KAE zlib Compression Tuning● Purpose

Optimize zlib compression to maximize the CPU capability of processing OSDprocesses and maximize the hardware performance.

● Procedure

Enable the hardware acceleration engine to implement zlib compression.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 78

Page 83: Tuning Guides - HUAWEI CLOUD

Preparing the EnvironmentNO TE

Before installing the accelerator engine, you need to apply for and install a license.

License application guide:

https://support.huawei.com/enterprise/zh/doc/EDOC1100068122/b9878159

Installation guide:

https://support.huawei.com/enterprise/en/doc/EDOC1100048786/ba20dd15

Download the acceleration engine installation package and developer Guide.

Download link: https://github.com/kunpengcompute/KAE/tags

Installing the Acceleration EngineNO TE

The developer guide describes how to install and use all modules of the accelerator engine.Select an appropriate installation mode based on the developer guide.

For details, see Installing the KAE Software Package Using Source Code.

Step 1 Install the acceleration engine according to the developer guide.

Step 2 Install the zlib library.

1. Download KAEzip.2. Download zlib-1.2.11.tar.gz from the zlib official website and copy it to

KAEzip/open_source.3. Perform the compilation and installation.

cd KAEzipsh setup.sh install

The zlib library is installed in /usr/local/kaezip.

Step 3 Back up the connection.mv /lib64/libz.so.1 /lib64/libz.so.1-bak

Step 4 Replace the zlib software compression algorithm dynamic library.cd /usr/local/kaezip/libcp libz.so.1.2.11 /lib64/mv /lib64/libz.so.1 /lib64/libz.so.1-bakln -s /lib64/libz.so.1.2.11 /lib64/libz.so.1

NO TE

In the cd /usr/local/zlib command, /usr/local/zlib indicates the zlib installation path.Change it as required.

----End

NO TE

If the Ceph cluster is running before the dynamic library is replaced, run the followingcommand on all storage nodes to restart the OSDs for the change to take effect after thedynamic library is replaced:systemctl restart ceph-osd.target

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 79

Page 84: Tuning Guides - HUAWEI CLOUD

Changing the Default Number of Accelerator QueuesNO TE

The default number of queues of the hardware accelerator is 256. To fully utilize theperformance of the accelerator, change the number of queues to 512 or 1024.

Step 1 Remove hisi_zip.rmmod hisi_zip

Step 2 Set the default accelerator queue parameter pf_q_num=512.vi /etc/modprobe.d/hisi_zip.confoptions hisi_zip uacce_mode=2 pf_q_num=512

Step 3 Load hisi_zip.modprobe hisi_zip

Step 4 Check the hardware accelerator queue.cat /sys/class/uacce/hisi_zip-*/attrs/available_instances

The change is successful if the following information is displayed.

Step 5 Check the dynamic library links. If libwd.so.1 is contained in the command output,the operation is successful.ldd /lib64/libz.so.1

----End

Adapting Ceph to the AcceleratorNO TE

Currently, the mainline Ceph versions allow configuring the zlib compression mode usingthe configuration file. The released Ceph versions (by version 15.2.3) adopt the zlibcompression mode without data headers or trailers. However, the current hardwareacceleration library supports only the mode with data headers and trailers. Therefore, theCeph source code needs to be modified to adapt to the Kunpeng hardware accelerationlibrary. For details about the modification method, see the latest patch that has beenincorporated into the mainline versions.

https://github.com/ceph/ceph/pull/34852

The following uses Ceph 14.2.8 as an example to describe how to adapt Ceph to the zlibcompression engine.

Step 1 Obtain the source code.

URL: https://download.ceph.com/tarballs/

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 80

Page 85: Tuning Guides - HUAWEI CLOUD

After the source code package is downloaded, save it to the /home directory onthe server.

Step 2 Obtain the patch and save it to the /home directory.

https://mirrors.huaweicloud.com/kunpeng/archive/kunpeng_solution/storage/Patch/

Step 3 Go to the /home directory, decompress the source code package and go to thedirectory generated after decompression.cd /home && tar -zxvf ceph-14.2.8.tar.gz && cd ceph-14.2.8/

Step 4 Apply the patch in the source code directory.patch -p1 < ../ceph-14.2.8-zlib-compress.patch

Step 5 After modifying the source code, compile Ceph.● CentOS: See Ceph 14.2.1 Porting Guide (CentOS 7.6).● openEuler: See Ceph 14.2.8 Porting Guide (openEuler 20.03).

Step 6 Install Ceph.

Step 7 Modify the ceph.conf file to configure the zlib compression mode.vi /etc/ceph/ceph.confcompressor_zlib_winsize=15

Step 8 Modify the systemd permissions. In Ceph, the RGW process service is managed bysystemd. To enable systemd to access hardware acceleration devices, you need tomodify the configuration on each RGW node as follows:

1. Open the configuration file.vi /usr/lib/systemd/system/[email protected]

Change PrivateDevices=yes to PrivateDevices=no.2. Make the modification take effect.

systemctl daemon-reload

Step 9 Restart the Ceph cluster for the configuration to take effect.ceph daemon osd.0 config show|grep compressor_zlib_winsize

----End

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 81

Page 86: Tuning Guides - HUAWEI CLOUD

3.4 High-Performance Storage

3.4.1 Hardware TuningHigh-performance configuration tuning

● PurposeBalance the loads of the two CPUs.

● ProcedureEvenly distribute the NVMe SSDs and NICs to the two CPUs.

HardwareType

Tuning Method Remarks

NIC NUMA resourcebalancing

For example, you can insert the LOMsinto the PCIe slots of CPU 1 and theMellanox ConnectX-4 NICs into the PCIeslots of CPU 2 to balance the loads ofthe two CPUs.

Storage NUMA resourcebalancing

For example, you can insert six NVMeSSDs into the PCIe slots of CPU 1 andthe other six NVMe SSDs into the PCIeslots of CPU 2 to balance the loads ofthe two CPUs.

3.4.2 Ceph Tuning● Purpose

Adjust the Ceph configuration items to fully utilize the hardware performanceof the system.

● ProcedureYou can edit the /etc/ceph/ceph.conf file to modify all Ceph configurationparameters.For example, to change the number of copies to 4, you can addosd_pool_default_size = 4 to the /etc/ceph/ceph.conf file and run thesystemctl restart ceph.target command to restart the Ceph daemon processfor the change to take effect.The setting takes effect only for the current Ceph node. To enable the settingsof the entire Ceph cluster to take effect, you need to modify the ceph.conffile of each Ceph node and restart the Ceph daemon process.Table 3-19 lists the optimization items.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 82

Page 87: Tuning Guides - HUAWEI CLOUD

Table 3-19 Ceph parameters

Parameter Description Suggestion

[global]

osd_pool_default_min_size

Specifies the minimumnumber of I/O copiesthat the PG can receive.If a PG is in thedegraded state, its I/Ocapability is notaffected.

Default value: 0Recommended value: 1

cluster_network Configures a networksegment different fromthe public network. Thisnetwork segment isused for replication anddata balancing betweenOSDs to relieve thepressure on the publicnetwork.

Default value: noneRecommended value:192.168.4.0/24

osd_pool_default_size

Specifies the number ofcopies.

Default value: 3Recommended value: 3

mon_max_pg_per_osd

Indicates the PG alarmthreshold. You canincrease the value forbetter performance.

Default value: 250Recommended value: 3000

[rgw]

rgw_override_bucket_index_max_shards

Specifies the number ofshards per bucket index.The value 0 indicatesthat no shard isavailable.

Default value: 0Recommended value: 8

PG Distribution Tuning● Purpose

Adjust the number of PGs on each OSD to balance the load on each OSD.● Procedure

By default, Ceph allocates eight PGs/PGPs to each storage pool. Whencreating a storage pool, run the ceph osd pool create {pool-name} {pg-num}{pgp-num} command to specify the number of PGs/PGPs, or run the ceph osdpool set {pool_name} pg_num {pg-num} and ceph osd pool set {pool_name}pgp_num {pgp-num} commands to change the number of PGs/PGPs createdin a storage pool. After the modification, run the ceph osd pool get{pool_name} pg_num/pgp_num command to check the number of PGs/PGPsin the storage pool.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 83

Page 88: Tuning Guides - HUAWEI CLOUD

The default value of ceph balancer mode is none. You can run the cephbalancer mode upmap command to change it to upmap. The Ceph balancerfunction is disabled by default. You can run the ceph balancer on/offcommand to enable or disable the Ceph balancer function.Table 3-20 describes the PG distribution parameters.

Table 3-20 PG distribution parameters

Parameter Description Suggestion

pg_num Total PGs =(Total_number_of_OSDx 100)/max_replication_countRound up the result tothe nearest integerpower of 2.

Default value: 8Symptom: A warning isdisplayed if PGs areinsufficient.Suggestion: Calculate thevalue based on theformula.

pgp_num Sets the number ofPGPs to be the same asthat of PGs.

Default value: 8Symptom: It isrecommended that thenumber of PGPs be thesame as that of PGs.Suggestion: Calculate thevalue based on theformula.

ceph_balancer_mode

Enables the balancerplug-in and sets theplug-in mode toupmap.

Default value: noneSymptom: If PGs are notevenly distributed acrossOSDs, some OSDs may beoverloaded and becomebottlenecks.Suggestion: Set thisparameter to upmap.

NO TE

● The number of PGs carried by each OSD must be the same or close. Otherwise,some OSDs may be overloaded and become bottlenecks. The balancer plug-in canbe used to optimize the PG distribution. You can run the ceph balancer eval orceph pg dump command to view the PG distribution.

● Run the eph balancer mode upmap and ceph balancer on commands toautomatically optimize Ceph PG distribution. Ceph adjusts the distribution of a fewPGs every 60 seconds. Run the ceph balancer eval or ceph pg dump command toview the PG distribution. If the PG distribution does not change, the distribution isoptimal.

● The PG distribution of each OSD affects the load balancing of write data. Inaddition to optimizing the number of PGs carried by each OSD, the distribution ofthe primary PGs also needs to be optimized. That is, the primary PGs need to bedistributed to each OSD as evenly as possible.

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 84

Page 89: Tuning Guides - HUAWEI CLOUD

3.4.3 KAE MD5 Digest Algorithm Tuning● Purpose

Optimize the MD5 calculation process when the RGW writes objects tomaximize the CPU capability of processing the RGW process and maximizethe hardware performance.

● ProcedureEnable the Kunpeng Accelerator Engine (KAE) to implement MD5 calculation.

Environment PreparationsNO TE

Before installing the KAE, you need to apply for and install a license.License application guide:https://support.huawei.com/enterprise/zh/doc/EDOC1100068122/b9878159Installation guide:https://support.huawei.com/enterprise/zh/doc/EDOC1100048792/ba20dd15

Download the acceleration engine installation package and developer Guide.

URL: https://github.com/kunpengcompute/KAE/releases/tag/v1.3.6-bata

Installing the Accelerator EngineNO TE

● The developer guide describes how to install and use all modules of the acceleratorengine. After reading the guide, select an appropriate installation mode.

● If the CentOS 7.6 and OpenSSL 1.0.2k are used, you need to download thecorresponding libkae engine software package. The installation method of the packageis the same as that in the developer guide.

Install the accelerator engine as instructed by the developer guide.

Changing the Default Number of Accelerator QueuesNO TE

The default number of hardware accelerator queues is 256. When the number ofaccelerator queues remaining (which can be obtained in Step 4) is 0 or a small valueduring service running, you can change the number of queues to 512 or 1024 to maximizethe accelerator performance.

Step 1 Uninstall hisi_sec2.rmmod hisi_sec2

Step 2 Set the default accelerator queue parameter pf_q_num=512.vi /etc/modprobe.d/hisi_sec2.confoptions hisi_sec2 uacce_mode=2 enable_sm4_ctr=1 pf_q_num=512

Step 3 Load hisi_sec2.modprobe hisi_sec2

Step 4 Check the hardware accelerator queue.cat /sys/class/uacce/hisi_sec2-*/attrs/available_instances

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 85

Page 90: Tuning Guides - HUAWEI CLOUD

The change is successful if the following information is displayed.

----End

Adapting Ceph to the AcceleratorNO TE

Currently, the Ceph mainline version supports the configuration of the OpenSSL externalengine using the configuration file. However, this feature is not available in the releasedCeph versions (by v15.2.3). To use MD5 hardware acceleration in these versions, you needto modify the Ceph source code. For details about the modification method, see the latestpatch that has been incorporated into the mainline versions.https://github.com/ceph/ceph/pull/33964/The following uses Ceph v14.2.8 as an example to describe how to adapt Ceph to MD5hardware acceleration.

Step 1 Obtain the source code.

URL: https://download.ceph.com/tarballs/

After the source code package is downloaded, save it to the /home directory onthe server.

Step 2 Obtain the adaptation patch. Download the patch that enables the OpenSSLexternal engine of Ceph v14.2.8. After downloading the patch, save it to the /home directory.

Download the patch at https://mirrors.huaweicloud.com/kunpeng/archive/kunpeng_solution/storage/Patch/.

Step 3 Go to the /home directory, decompress the source code package and go to thedirectory generated after decompression.cd /home && tar -zxvf ceph-14.2.8.tar.gz && cd ceph-14.2.8/

Step 4 Apply the patch in the root directory of the source code.patch -p1 < ../ceph-14.2.8-common-rgw-add-openssl-engine-support.patch

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 86

Page 91: Tuning Guides - HUAWEI CLOUD

Step 5 After modifying the source code, compile Ceph.● CentOS: See Ceph 14.2.1 Porting Guide (CentOS 7.6).● openEuler: See Ceph 14.2.8 Porting Guide (openEuler 20.03).

Step 6 Install Ceph.

Step 7 Modify the systemd permissions. In Ceph, the RGW process service is managed bysystemd. To enable systemd to access hardware acceleration devices, you need tomodify the configuration on each RGW node as follows:

1. Open the configuration file.vi /usr/lib/systemd/system/[email protected]

Change PrivateDevices=yes to PrivateDevices=no.2. Make the modification take effect.

systemctl daemon-reload

Step 8 Use KAE to accelerate RGW digest computing. Add the following OpenSSL engineoptions to the global section in the ceph.conf file on the node where RGW isdeployed:openssl_engine_opts = "engine_id=kae,dynamic_path=/usr/local/lib/engines-1.1/libkae.so,KAE_CMD_ENABLE_ASYNC=0,default_algorithms=DIGESTS,init=1"

dynamic_path is the default installation path of libkae.so in the libkae enginesoftware package. default_algorithms=DIGESTS indicates that this engine is usedfor digest algorithms including MD5. After completing the configuration,synchronize the configuration to all nodes where the RGW is deployed.

Step 9 Restart RGW for the settings to take effect. Run the following command on eachnode where the RGW is deployed:systemctl restart ceph-radosgw.target

----End

Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 87

Page 92: Tuning Guides - HUAWEI CLOUD

4 Ceph File Storage Tuning Guide

4.1 Introduction

4.2 General-Purpose Storage

4.1 Introduction

4.1.1 Components

CephCeph is a distributed, scalable, reliable, and high-performance storage systemplatform that supports storage interfaces including block devices, file systems, andobject gateways. The optimization methods described in this document includehardware optimization and software configuration optimization. Software codeoptimization is not involved. By adjusting the system and Ceph configurationparameters, Ceph can fully utilize the hardware performance of the system. CephPlacement Group (PG) distribution optimization and object storage daemon(OSD) core binding aim to balance drive loads and prevent any OSD frombecoming a bottleneck. In addition, in general-purpose storage scenarios, usingNVMe SSDs as Bcache can also improve performance. Figure 4-1 shows the Cepharchitecture.

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 88

Page 93: Tuning Guides - HUAWEI CLOUD

Figure 4-1 Ceph architecture

Table 4-1 describes the Ceph modules and components.

Table 4-1 Module functions

Module Function

RADOS Reliable Autonomic Distributed Object Store (RADOS) is theheart of a Ceph storage cluster. Everything in Ceph is stored byRADOS in the form of objects irrespective of their data types. TheRADOS layer ensures data consistency and reliability throughdata replication, fault detection and recovery, and data recoveryacross cluster nodes.

OSD Object storage daemons (OSDs) store the actual user data. EveryOSD is usually bound to one physical drive. The OSDs handle theread/write requests from clients.

MON The monitor (MON) is the most important component in a Cephcluster. It manages the Ceph cluster and maintains the status ofthe entire cluster. The MON ensures that related components ofa cluster can be synchronized at the same time. It functions asthe leader of the cluster and is responsible for collecting,updating, and publishing cluster information. To avoid singlepoints of failure (SPOFs), multiple MONs are deployed in a Cephenvironment, and they must handle the collaboration betweenthem.

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 89

Page 94: Tuning Guides - HUAWEI CLOUD

Module Function

MGR The manager (MGR) is a monitoring system that providescollection, storage, analysis (including alarming), andvisualization functions. It makes certain cluster parametersavailable for external systems.

Librados Librados is a method that simplifies access to RADOS. Currently,it supports programming languages PHP, Ruby, Java, Python, C,and C++. It provides RADOS, a local interface of the Ceph storagecluster, and is the base component of other services such as theRADOS block device (RBD) and RADOS gateway (RGW). Inaddition, it provides the Portable Operating System Interface(POSIX) for the Ceph file system (CephFS). The Librados API canbe used to directly access RADOS, enabling developers to createtheir own interfaces for accessing the Ceph cluster storage.

RBD The RADOS block device (RBD) is the Ceph block device thatprovides block storage for external systems. It can be mapped,formatted, and mounted like a drive to a server.

RGW The RADOS gateway (RGW) is a Ceph object gateway thatprovides RESTful APIs compatible with S3 and Swift. The RGWalso supports multi-tenant and OpenStack Identity service(Keystone).

MDS The Ceph Metadata Server (MDS) tracks the file hierarchy andstores metadata used only for CephFS. The RBD and RGW do notrequire metadata. The MDS does not directly provide dataservices for clients.

CephFS The CephFS provides a POSlX-compatible distributed file systemof any size. It depends on the Ceph MDS to track the filehierarchy, namely the metadata.

Vdbench

Vdbench is a command line utility designed to help engineers and customersgenerate drive I/O loads for verifying storage performance and data integrity. Youcan also specify Vdbench execution parameters by entering text files.

Vdbench has many parameters. Table 4-2 lists some important commonparameters.

Table 4-2 Common parameters

Parameter

Description

-f Specifies a script file for the pressure test.

-o Specifies the path for exporting a report. The default value is thecurrent path.

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 90

Page 95: Tuning Guides - HUAWEI CLOUD

Parameter

Description

lun Specifies the LUN device or file to be tested.

size Specifies the size of the LUN device or file to be tested.

rdpct Specifies the read percentage. The value 100 indicates full read, andthe value 0 indicates full write.

seekpct Specifies the percentage of random data. The value 100 indicates allrandom data, and the value 0 indicates sequential data.

elapsed Specifies the duration of the current test.

4.1.2 Environment

Physical NetworkingThe physical environment of the Ceph block devices contains two network layersand three nodes. In the physical environment, the MON, MGR, MDS, and OSDnodes are deployed together. At the network layer, the public network is separatedfrom the cluster network. The two networks use 25GE optical ports forcommunication.

Figure 4-2 shows the physical network.

Figure 4-2 Physical networking

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 91

Page 96: Tuning Guides - HUAWEI CLOUD

Hardware Configuration

Table 4-3 shows the Ceph hardware configuration.

Table 4-3 Hardware configuration

Server TaiShan 200 server (model 2280)

Processor Kunpeng 920 5230 processor

Core 2 x 32-core

CPU frequency 2600 MHz

Memory capacity 12 x 16 GB

Memory frequency 2666 MHz (8 Micron 2R memory modules)

NIC IN200 NIC (4 x 25GE)

Drive System drives: RAID 1 (2 x 960 GB SATA SSDs)Data drives of general-purpose storage: JBOD enabledin RAID mode (12 x 4 TB SATA HDDs)

NVMe SSD Acceleration drive of general-purpose storage: 1 x 3.2TB ES3600P V5 NVMe SSDData drives of high-performance storage: 12 x 3.2 TBES3600P V5 NVMe SSDs

RAID controller card Avago SAS 3508

Software Versions

Table 4-4 lists the required software versions.

Table 4-4 Software versions

Software Version

OS CentOS Linux release 7.6.1810

openEuler 20.03 LTS SP1

Ceph 14.2.x Nautilus

ceph-deploy 2.0.1

Vdbench 5.04.06

Node Information

Table 4-5 describes the IP network segment planning of the hosts.

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 92

Page 97: Tuning Guides - HUAWEI CLOUD

Table 4-5 Node information

Host Type HostName

Public NetworkSegment

Cluster NetworkSegment

OSD/MON node Node 1 192.168.3.0/24 192.168.4.0/24

OSD/MGR node Node 2 192.168.3.0/24 192.168.4.0/24

OSD/MDS node Node 3 192.168.3.0/24 192.168.4.0/24

Component DeploymentTable 4-6 describes the deployment of service components in the Ceph blockdevice cluster.

Table 4-6 Component deployment

PhysicalMachineName

OSD MON MGR MDS

Node 1 12 1 1 1

Node 2 12 1 1 1

Node 3 12 1 1 1

Cluster CheckRun the ceph health command to check the cluster health status. If HEALTH_OKis displayed, the cluster is running properly.

4.1.3 Tuning Guidelines and Process Flow

Tuning GuidelinesPerformance optimization must comply with the following principles:

● When analyzing the performance, analyze the system resource bottlenecksfrom multiple aspects. For example, insufficient memory capacity may causethe CPU to be occupied by memory scheduling tasks and the CPU usage toreach 100%.

● Adjust only one performance parameter at a time.● The analysis tool may consume system resources and aggravate certain

system resource bottlenecks. Therefore, the impact on applications must beavoided or minimized.

Tuning Process FlowThe tuning analysis flow is as follows:

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 93

Page 98: Tuning Guides - HUAWEI CLOUD

1. In many cases, pressure test traffic is not completely sent to the backend(server). For example, a protection policy may be triggered on network accesslayer services such as Server Load Balancing (SLB), Web Application Firewall(WAF), High Defense IP, and even Content Delivery Network (CDN) /siteacceleration in a cloud-based architecture. This occurs because thespecifications, such as bandwidth, maximum number of connections, andnumber of new connections, are limited, or the pressure test shows thefeatures of Challenge Collapsar (CC) and Distributed Denial of Service (DDoS)attacks. As a result, the pressure test results do not meet expectations.

2. Check whether the key indicators meet the requirements. If not, locate thefault. The fault may be caused by the servers (in most cases) or the clients (ina few cases).

3. If the problem is caused by the servers, focus on the hardware indicators suchas the CPU, memory, drive I/O, and network I/O. Locate the fault and performfurther analysis on the abnormal hardware indicator.

4. If all hardware indicators are normal, check the middleware indicators such asthe thread pool, connection pool, and GC indicators. Perform further analysisbased on the abnormal middleware indicator.

5. If all middleware indicators are normal, check the database indicators such asthe slow query SQL indicators, hit ratio, locks, and parameter settings.

6. If the preceding indicators are normal, the algorithm, buffer, cache,synchronization, or asynchronization of the applications may be faulty.Perform further analysis.

Table 4-7 lists the possible bottlenecks.

Table 4-7 Possible bottlenecks

Bottleneck

Description

Hardware/Specifications

Problems of the CPU, memory, and drive I/O. The problems areclassified into server hardware bottlenecks and network bottlenecks(Network bottlenecks can be ignored in a LAN).

Middleware

Problems of software such as application servers and web servers, anddatabase systems. For example, a bottleneck may be caused ifparameters of the Java Database Connectivity (JDBC) connection poolconfigured on the WebLogic platform are set improperly.

Applications

Problems related to applications developed by developers. Forexample, when the system receives a large number of user requests,the following problems may cause low system performance, includingslow SQL statements and improper Java Virtual Machine (JVM)parameters, container settings, database design, program architectureplanning, and program design (insufficient threads for serial processingand request processing, no buffer, no cache, and mismatch betweenproducers and consumers).

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 94

Page 99: Tuning Guides - HUAWEI CLOUD

Bottleneck

Description

OS Problems related to the OS such as Windows, UNIX, or Linux. Forexample, if the physical memory capacity is insufficient and the virtualmemory capacity is improper during a performance test, the virtualmemory swap efficiency may be greatly reduced. As a result, theresponse time is increased. This bottleneck is caused by the OS.

Networkdevices

Problems related to devices such as the firewalls, dynamic loadbalancers, and switches. Currently, more network access products areused in the cloud service architecture, including but not limited to theSLB, WAF, High Defense IP, CDN, and site acceleration. For example, ifa dynamic load distribution mechanism is set on the dynamic loadbalancer, the dynamic load balancer automatically sends subsequenttransaction requests to low-load servers when the hardware resourceusage of a server reaches the limit. If the dynamic load balancer doesnot function as expected in the test, the problem is a networkbottleneck.

General tuning procedure:

Figure 4-3 shows the general tuning procedure.

Figure 4-3 General tuning procedure

4.2 General-Purpose Storage

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 95

Page 100: Tuning Guides - HUAWEI CLOUD

4.2.1 Hardware Tuning

NVMe SSD Tuning● Purpose

Reduce cross-chip data overheads.

● Procedure

Install the NVMe SSDs and NIC into the same riser card.

DIMM Installation Mode Tuning● Purpose

Populate one DIMM Per Channel (1DPC) to maximize the memoryperformance. That is, populate the DIMM 0 slot of each channel.

● Procedure

Preferentially populate DIMM 0 slots (DIMM 000, 010, 020, 030, 040, 050,100, 110, 120, 130, 140, and 150). Of the three digits in the DIMM slotnumber, the first digit indicates the CPU, the second digit indicates the DIMMchannel, and the third digit indicates the DIMM. Populate the DIMM slotswhose third digit is 0 in ascending order.

4.2.2 System Tuning

Optimizing the OS Configuration● Purpose

Adjust the system configuration to maximize the hardware performance.

● Procedure

Table 4-8 lists the optimization items.

Table 4-8 OS configuration parameters

Parameter Description Suggestion ConfigurationMethod

vm.swappiness

The swappartition is thevirtual memoryof the system.Do not use theswap partitionbecause it willdeterioratesystemperformance.

Default value: 60Symptom: Theperformancedeterioratessignificantly whenthe swappartition is used.Suggestion:Disable the swappartition and setthis parameter to0.

Run the followingcommand:sudo sysctl vm.swappiness=0

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 96

Page 101: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion ConfigurationMethod

MTU Maximum size ofa data packetthat can passthrough a NIC.After the value isincreased, thenumber ofnetwork packetscan be reducedand theefficiency can beimproved.

Default value:1500 bytesSymptom: Runthe ip addrcommand to viewthe value.Suggestion: Setthe maximumsize of a datapacket that canpass through aNIC to 9000bytes.

1. Run thefollowingcommand:vi /etc/sysconfig/network-scripts/ifcfg-$(Interface)AddMTU="9000".NOTE

${Interface}indicates thenetwork portname.

2. After theconfiguration iscomplete, restartthe networkservice.service network restart

pid_max The defaultvalue ofpid_max is32768, which issufficient innormal cases.However, whenheavy workloadsare beingprocessed, thisvalue isinsufficient andmay causememoryallocationfailure.

Default value:32768Symptom: Runthe cat /proc/sys/kernel/pid_maxcommand to viewthe value.Suggestion: Setthe maximumnumber ofthreads that canbe generated inthe system to4194303.

Run the followingcommand:echo 4194303 > /proc/sys/kernel/pid_max

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 97

Page 102: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion ConfigurationMethod

file_max Maximumnumber of filesthat can beopened by allprocesses in thesystem. Inaddition, someprograms cancall the setrlimitinterface to setthe limit on eachprocess. If thesystem generatesa large numberof errorsindicating thatfile handles areused up, increasethe value of thisparameter.

Default value:13291808Symptom: Runthe cat /proc/sys/fs/file-max command toview the value.Suggestion: Setthe maximumnumber of filesthat can beopened by allprocesses in thesystem to thevalue displayedafter the cat /proc/meminfo |grep MemTotal |awk '{print $2}'command is run.

Run the followingcommand:echo ${file-max} > /proc/sys/fs/file-max

NOTE${file-max} is thevalue displayed afterthe cat /proc/meminfo | grepMemTotal | awk'{print $2}' is run.

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 98

Page 103: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion ConfigurationMethod

read_ahead Linux readaheadmeans that theLinux kernelprefetches acertain area ofthe specified fileand loads it intothe page cache.As a result, whenthe area isaccessedsubsequently,block caused bypage fault willnot occur.Reading datafrom memory ismuch faster thanreading datafrom drives.Therefore, thereadaheadfeature caneffectivelyreduce thenumber of driveseeks and theI/O waiting timeof theapplications. It isone of theimportantmethods foroptimizing thedrive read I/Operformance.

Default value:128 KBSymptom:Readahead caneffectively reducethe number ofdrive seeks andthe I/O waitingtime of theapplications. Runthe /sbin/blockdev --getra /dev/sdb toview the value.Suggestion:Change the valueto 8192 KB.Improve the driveread efficiency bypre-reading andrecording thedata to randomaccess memory(RAM).

Run the followingcommand:/sbin/blockdev --setra /dev/sdb

NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 99

Page 104: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion ConfigurationMethod

I/O_Scheduler

The Linux I/Oscheduler is acomponent ofthe Linux kernel.You can adjustthe scheduler tooptimize systemperformance.

Default value:CFQSymptom: TheLinux I/Oscheduler needsto be configuredbased ondifferent storagedevices for theoptimal systemperformance.Suggestion: Setthe I/Oscheduling policyto deadline forHDDs and noopfor SSDs.

Run the followingcommand:echo deadline > /sys/block/sdb/queue/scheduler

NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.

nr_requests If the Linuxsystem receives alarge number ofread requests,the defaultnumber ofrequest queuesmay beinsufficient. Todeal with thisproblem, you candynamicallyadjust thedefault numberof requestqueues inthe /sys/block/hda/queue/nr_requests file.

Default value:128Symptom:Increase the drivethroughput byadjusting thenr_requestsparameter.Suggestion: Setthe number ofdrive requestqueues to 512.

Run the followingcommand:echo 512 > /sys/block/sdb/queue/nr_requests

NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.

Optimizing the Network Performance● Purpose

This test uses the 25GE Ethernet adapter (Hi1822) with four ports, SFP+. It isused as an example to describe how to optimize the NIC parameters for theoptimal performance.

● Procedure

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 100

Page 105: Tuning Guides - HUAWEI CLOUD

The optimization methods include adjusting NIC parameters and interrupt-core binding (binding interrupts to the physical CPU of the NIC). Table 4-9describes the optimization items.

Table 4-9 NIC parameters

Parameter Description Suggestion

irqbalance System interruptbalancing service, whichautomatically allocatesNIC software interrupts toidle CPUs.

Default value: activeSymptom: When thisfunction is enabled, thesystem automaticallyallocates NIC softwareinterrupts to idle CPUs.Suggestion:● To disable irqbalance,

set this parameter toinactive.systemctl stop irqbalance

● Keep the functiondisabled after theserver is restarted.systemctl disable irqbalance

rx_buff Aggregation of largenetwork packets requiresmultiple discontinuousmemory pages and causeslow memory usage. Youcan increase the value ofthis parameter to improvethe memory usage.

Default value: 2Symptom: When the valueis set to 2 by default,interrupts consume alarge number of CPUresources.Suggestion: Load therx_buff parameter and setthe value to 8 to reducediscontinuous memoryand improve memoryusage and performance.For details, see descriptionfollowing the table.

ring_buffer You can increase thethroughput by adjustingthe NIC buffer size.

Default value: 1024Symptom: Run theethtool -g NIC namecommand to view thevalue.Suggestion: Change thering_buffer queue size to4096. For details, seedescription following thetable.

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 101

Page 106: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

lro lro indicates large receiveoffload. After this functionis enabled, multiple smallpackets are aggregatedinto one large packet forbetter efficiency.

Default value: offSymptom: After thisfunction is enabled, themaximum throughputincreases significantly.Suggestion: Enable thelarge-receive-offloadfunction to help networksimprove the efficiency ofsending and receivingpackets. For details, seedescription following thetable.

hinicadm_lro_-ihinic0_-t_<NUM>

Received aggregatedpackets are sent after thetime specified by NUM (inmicroseconds). You canset the value to 256microseconds for betterefficiency.

Default value: 16microsecondsSymptom: This parameteris used with the LROfunction.Suggestion: Change thevalue to 256microseconds.

hinicadm_lro_-i_hinic0_-n_<NUM>

Received aggregatedpackets are sent after thenumber of aggregatedpackets reaches the valuespecified by <NUM>. Youcan set the value to 32 forbetter efficiency.

Default value: 4Symptom: This parameteris used with the LROfunction.Suggestion: Change thevalue to 32.

– Adjusting rx_buff

i. Go to the /etc/modprobe.d directory.cd /etc/modprobe.d

ii. Create the hinic.conf file.vi /etc/modprobe.d/hinic.conf

Add the following information to the file:options hinic rx_buff=8

iii. Reload the driver.rmmod hinicmodprobe hinic

iv. Check whether the value of rx_buff is changed to 8.cat /sys/bus/pci/drivers/hinic/module/parameters/rx_buff

– Adjusting ring_buffer

i. Change the buffer size from the default value 1024 to 4096.ethtool -G <NIC name> rx 4096 tx 4096

ii. Check the current buffer size.

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 102

Page 107: Tuning Guides - HUAWEI CLOUD

ethtool -g <NIC name>

– Enabling LRO

i. Enable the LRO function for a NIC.ethtool -K <NIC name> lro on

ii. Check whether the function is enabled.ethtool -k <NIC name> | grep large-receive-offload

NO TE

In addition to optimizing the preceding parameters, you need to bind the NIC softwareinterrupts to the cores.

1. Disable the irqbalance service.

2. Query the NUMA node to which the NIC belongs:cat /sys/class/net/<Network port name>/device/numa_node

3. Query the CPU cores that correspond to the NUMA node.lscpu

4. Query the interrupt ID corresponding to the NIC.cat /proc/interrupts | grep <Network port name> | awk -F ':' '{print $1}'

5. Bind the software interrupt to the core corresponding to the NUMA node.echo <core number> > /proc/irq/ <Interrupt ID> /smp_affinity_list

4.2.3 Ceph Tuning

Modifying Ceph Configuration● Purpose

Adjust the Ceph configuration to maximize system resource usage.● Procedure

You can edit the /etc/ceph/ceph.conf file to modify all Ceph configurationparameters. For example, you can osd_pool_default_size = 4 to the /etc/ceph/ceph.conf file to change the default number of copies to 4 and run thesystemctl restart ceph.target command to restart the Ceph daemon processfor the change to take effect.The preceding operations take effect only on the current Ceph node. You needto modify the ceph.conf file on all Ceph nodes and restart the Ceph daemonprocess for the modification to take effect on the entire Ceph cluster. Table4-10 describes the Ceph optimization items.

Table 4-10 Ceph parameter configuration

Parameter Description Suggestion

[global]

cluster_network You can configure anetwork segmentdifferent from the publicnetwork for OSDreplication and databalancing to relieve thepressure on the publicnetwork.

Recommended value:192.168.4.0/24You can set this parameteras required as long as it isdifferent from the publicnetwork segment.

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 103

Page 108: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

public_network Recommended value:192.168.3.0/24You can set this parameteras required as long as it isdifferent from the clusternetwork segment.

osd_pool_default_size

Number of copies Recommended value: 3

osd_memory_target

Size of memory thateach OSD process isallowed to obtain

Recommended value:4294967296

For details about how to optimize other parameters, see Table 4-11.

Table 4-11 Other parameter configuration

Parameter Description Suggestion

[global]

osd_pool_default_min_size

Minimum number of I/Ocopies that the PG canreceive. If a PG is in thedegraded state, its I/Ocapability is notaffected.

Default value: 0Recommended value: 1

cluster_network You can configure anetwork segmentdifferent from the publicnetwork for OSDreplication and databalancing to relieve thepressure on the publicnetwork.

This parameter has nodefault value.Recommended value:192.168.4.0/24

osd_memory_target

Size of memory thateach OSD process isallowed to obtain

Default value: 4294967296Recommended value:4294967296

[mon]

mon_clock_drift_allowed

Clock drift betweenMONs

Default value: 0.05Recommended value: 1

mon_osd_min_down_reporters

Minimum down OSDquantity that triggers areport to the MONs

Default value: 2Recommended value: 13

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 104

Page 109: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

mon_osd_down_out_interval

Number of seconds thatCeph waits before anOSD is marked as downor out

Default value: 600Recommended value: 600

[OSD]

osd_journal_size OSD journal size Default value: 5120Recommended value:20000

osd_max_write_size

Maximum size (in MB)of data that can bewritten by an OSD at atime

Default value: 90Recommended value: 512

osd_client_message_size_cap

Maximum size (in bytes)of data that can bestored in the memory bythe clients

Default value: 100Recommended value:2147483648

osd_deep_scrub_stride

Number of bytes thatcan be read during deepscrubbing

Default value: 524288Recommended value:131072

osd_map_cache_size

Size of the cache (inMB) that stores the OSDmap

Default value: 50Recommended value: 1024

osd_recovery_op_priority

Restoration priority. Thevalue ranges from 1 to63. A larger valueindicates higher resourceusage.

Default value: 3Recommended value: 2

osd_recovery_max_active

Number of activerestoration requests inthe same period

Default value: 3Recommended value: 10

osd_max_backfills Maximum number ofbackfills allowed by anOSD

Default value: 1Recommended value: 4

osd_min_pg_log_entries

Minimum number ofreserved PG logs

Default value: 3000Recommended value:30000

osd_max_pg_log_entries

Maximum number ofreserved PG logs

Default value: 3000Recommended value:100000

osd_mon_heartbeat_interval

Interval (in seconds) foran OSD to ping a MON

Default value: 30Recommended value: 40

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 105

Page 110: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

ms_dispatch_throttle_bytes

Maximum number ofmessages to bedispatched

Default value: 104857600Recommended value:1048576000

objecter_inflight_ops

Allowed maximumnumber of unsent I/Orequests. This parameteris used for client trafficcontrol. If the number ofunsent I/O requestsexceeds the threshold,the application I/O isblocked. The value 0indicates that thenumber of unsent I/Orequests is not limited.

Default value: 1024Recommended value:819200

osd_op_log_threshold

Number of operationlogs to be displayed at atime

Default value: 5Recommended value: 50

osd_crush_chooseleaf_type

Bucket type when theCRUSH rule useschooseleaf

Default value: 1Recommended value: 0

journal_max_write_bytes

Maximum number ofjournal bytes that canbe written at a time

Default value: 10485760Recommended value:1073714824

journal_max_write_entries

Maximum number ofjournal records that canbe written at a time

Default value: 100Recommended value:10000

[Client]

rbd_cache RBD cache Default value: TrueRecommended value: True

rbd_cache_size RBD cache size (inbytes)

Default value: 33554432Recommended value:335544320

rbd_cache_max_dirty

Maximum number ofdirty bytes allowedwhen the cache is set tothe writeback mode. Ifthe value is 0, the cacheis set to thewritethrough mode.

Default value: 25165824Recommended value:134217728

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 106

Page 111: Tuning Guides - HUAWEI CLOUD

Parameter Description Suggestion

rbd_cache_max_dirty_age

Duration (in seconds)for which the dirty datais stored in the cachebefore being flushed tothe drives

Default value: 1Recommended value: 30

rbd_cache_writethrough_until_flush

This parameter is usedfor compatibility withthe virtio driver earlierthan linux-2.6.32. Itprevents the situationthat data is written backwhen no flush request issent. After thisparameter is set, librbdprocesses I/Os inwritethrough mode. Themode is switched towriteback only after thefirst flush request isreceived.

Default value: TrueRecommended value: False

rbd_cache_max_dirty_object

Maximum number ofobjects. The defaultvalue is 0, whichindicates that thenumber is calculatedbased on the RBD cachesize. By default, librbdlogically splits the driveimage in a unit of 4 MB.Each chunk object isabstracted as an object.librbd manages thecache object. You canincrease the value of thisparameter improve theperformance.

Default value: 0Recommended value: 2

rbd_cache_target_dirty

Dirty data size thattriggers writeback. Thevalue cannot exceed thevalue ofrbd_cache_max_dirty.

Default value: 16777216Recommended value:235544320

Optimizing the PG Distribution● Purpose

Adjust the number of PGs on each OSD to balance the load on each OSD.

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 107

Page 112: Tuning Guides - HUAWEI CLOUD

● ProcedureBy default, Ceph allocates eight PGs/PGPs to each storage pool. Whencreating a storage pool, run the ceph osd pool create {pool-name} {pg-num}{pgp-num} command to specify the number of PGs/PGPs, or run the ceph osdpool set {pool_name} pg_num {pg-num} and ceph osd pool set {pool_name}pgp_num {pgp-num} command to change the number of PGs/PGPs createdin a storage pool. After the modification, run the ceph osd pool get{pool_name} pg_num/pgp_num command to check the number of PGs/PGPsin the storage pool.The default value of ceph balancer mode is none. You can run the cephbalancer mode upmap command to change it to upmap. The Ceph balancerfunction is disabled by default. You can run the ceph balancer on/offcommand is used to enable or disable the Ceph balancer function.Table 4-12 describes the PG distribution parameters.

Table 4-12 PG distribution parameters

Parameter Description Suggestion

pg_num Total PGs =(Total_number_of_OSDx 100)/max_replication_countRound up the result tothe nearest integerpower of 2.

Default value: 8Symptom: A warning isdisplayed if the number ofPGs is insufficient.Suggestion: Calculate thevalue based on theformula.

pgp_num Set the number of PGPsto be the same as thatof PGs.

Default value: 8Symptom: It isrecommended that thenumber of PGPs be thesame as the number ofPGs.Suggestion: Calculate thevalue based on theformula.

ceph_balancer_mode

Enable the balancerplug-in and set theplug-in mode toupmap.

Default value: noneSymptom: If the number ofPGs is unbalanced, someOSDs may be overloadedand become bottlenecks.Recommended value:upmap

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 108

Page 113: Tuning Guides - HUAWEI CLOUD

NO TE

● The number of PGs carried by each OSD must be the same or close. Otherwise,some OSDs may be overloaded and become bottlenecks. The balancer plug-in canbe used to optimize the PG distribution. You can run the ceph balancer eval orceph pg dump command to view the PG distribution.

● Run the eph balancer mode upmap and ceph balancer on commands toautomatically balance and optimize Ceph PGs. Ceph adjusts the distribution of afew PGs every 60 seconds. Run the ceph balancer eval or ceph pg dumpcommand to view the PG distribution. If the PG distribution does not change, thedistribution is optimal.

● The PG distribution of each OSD affects the load balancing of write data. Inaddition to optimizing the number of PGs corresponding to each OSD, thedistribution of the primary PGs also needs to be optimized. That is, the primaryPGs need to be distributed to each OSD as evenly as possible.

Binding OSDs to CPU Cores● Purpose

Bind each OSD process to a fixed CPU core.● Procedure

Add osd_numa_node = <NUM> to the /etc/ceph/ceph.conf file.Table 4-13 describes the optimization items.

Table 4-13 OSD core binding parameters

Parameter Description Suggestion

[osd.n]

osd_numa_node Bind the osd.n daemonprocess to a specified idleNUMA node, which is anode other than thenodes that process theNIC software interrupt.

This parameter has nodefault value.Symptom: If the CPU ofeach OSD process is thesame as that of the NICinterrupt, some CPUs maybe overloaded.Suggestion: To balance theCPU load pressure, avoidrunning each OSD processand NIC interrupt process(or other processes withhigh CPU usage) on thesame NUMA node.

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 109

Page 114: Tuning Guides - HUAWEI CLOUD

NO TE

● The Ceph OSD daemon process and NIC software interrupt process must run ondifferent NUMA nodes. Otherwise, CPU bottlenecks may occur when the networkload is heavy. By default, Ceph evenly allocates OSD processes to all CPU cores.You can add the osd_numa_node parameter to the ceph.conf file to avoid runningeach OSD process and NIC interrupt process (or other processes with high CPUusage) on the same NUMA node.

● Optimizing the Network Performance describes how to bind NIC softwareinterrupts to the CPU core of the NUMA node to which the NIC belongs. When thenetwork load is heavy, the usage of the CPU core bound to the software interruptsis high. Therefore, you are advised to set osd_numa_node to a NUMA nodedifferent from that of the NIC. For example, run the cat /sys/class/net/PortName/device/numa_node command to query the NUMA node of the NIC. If theNIC belongs to NUMA node 2, set osd_numa_node = 0 or osd_numa_node = 1 toprevent the OSD and NIC software interrupt from using the same CPU core.

Optimizing Compression Algorithm Configuration Parameters● Purpose

Adjust the compression algorithm configuration parameters to optimize theperformance of the compression algorithm.

● ProcedureThe default value of bluestore_min_alloc_size_hdd for Ceph is 32 KB. Thevalue of this parameter affects the size of the final data obtained after thecompression algorithm is run. Set this parameter to a smaller value tomaximize the compression rate of the compression algorithm.By default, Ceph uses five threads to process I/O requests in an OSD process.After the compression algorithm is enabled, the number of threads can causea performance bottleneck. Increase the number of threads to maximize theperformance of the compression algorithm.The following table describes the PG distribution parameters:

Parameter Description Suggestion

bluestore_min_alloc_size_hdd

Minimum size of objectsallocated to the HDDdata disks in theBlueStore storageengine

Default value: 32768Recommended value: 8192

osd_op_num_shards_hdd

Number of shards for anHDD data disk in anOSD process

Default value: 5Recommended value: 12

osd_op_num_threads_per_shard_hdd

Average number ofthreads of an OSDprocess for each HDDdata disk shard

Default value: 1Recommended value: 2

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 110

Page 115: Tuning Guides - HUAWEI CLOUD

Enabling BcacheBcache is a block layer cache of the Linux kernel. It uses SSDs as the cache ofHDDs for acceleration. To enable the Bcache kernel module, you need torecompile the kernel. For details, see the Bcache User Guide (CentOS 7.6).

Using the I/O Passthrough ToolThe I/O passthrough tool is a process optimization tool for balanced scenarios ofthe Ceph cluster. It can automatically detect and optimize OSDs in the Cephcluster. For details on how to use this tool, see the I/O Passthrough Tool UserGuide.

4.2.4 KAE zlib Compression Tuning● Purpose

Optimize zlib compression to maximize the CPU capability of processing OSDsand maximize the hardware performance.

● Procedurezlib compression is processed by the KAE.

Preparing the EnvironmentNO TE

Before installing the accelerator engine, you need to apply for and install a license.License application guide:https://support.huawei.com/enterprise/zh/doc/EDOC1100068122/b9878159Installation guide:https://support.huawei.com/enterprise/en/doc/EDOC1100048786/ba20dd15

Download the acceleration engine installation package and developer Guide.

Download link: https://github.com/kunpengcompute/KAE/tags

Installing the Acceleration EngineNO TE

The developer guide describes how to install and use all modules of the accelerator engine.Select an appropriate installation mode based on the developer guide.For details, see Installing the KAE Software Package Using Source Code.

Step 1 Install the acceleration engine according to the developer guide.

Step 2 Install the zlib library.

1. Download KAEzip.2. Download zlib-1.2.11.tar.gz from the zlib official website and copy it to

KAEzip/open_source.3. Perform the compilation and installation.

cd KAEzipsh setup.sh install

The zlib library is installed in /usr/local/kaezip.

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 111

Page 116: Tuning Guides - HUAWEI CLOUD

Step 3 Back up the connection.mv /lib64/libz.so.1 /lib64/libz.so.1-bak

Step 4 Replace the zlib software compression algorithm dynamic library.cd /usr/local/kaezip/libcp libz.so.1.2.11 /lib64/mv /lib64/libz.so.1 /lib64/libz.so.1-bakln -s /lib64/libz.so.1.2.11 /lib64/libz.so.1

NO TE

In the cd /usr/local/zlib command, /usr/local/zlib indicates the zlib installation path.Change it as required.

----End

NO TE

If the Ceph cluster is running before the dynamic library is replaced, run the followingcommand on all storage nodes to restart the OSDs for the change to take effect after thedynamic library is replaced:systemctl restart ceph-osd.target

Changing the Default Number of Accelerator QueuesNO TE

The default number of queues of the hardware accelerator is 256. To fully utilize theperformance of the accelerator, change the number of queues to 512 or 1024.

Step 1 Remove hisi_zip.rmmod hisi_zip

Step 2 Set the default accelerator queue parameter pf_q_num=512.vi /etc/modprobe.d/hisi_zip.confoptions hisi_zip uacce_mode=2 pf_q_num=512

Step 3 Load hisi_zip.modprobe hisi_zip

Step 4 Check the hardware accelerator queue.cat /sys/class/uacce/hisi_zip-*/attrs/available_instances

The change is successful if the following information is displayed.

Step 5 Check the dynamic library links. If libwd.so.1 is contained in the command output,the operation is successful.ldd /lib64/libz.so.1

----End

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 112

Page 117: Tuning Guides - HUAWEI CLOUD

Adapting Ceph to the AcceleratorNO TE

Currently, the mainline Ceph versions allow configuring the zlib compression mode usingthe configuration file. The released Ceph release versions (up to v15.2.3) adopt the zlibcompression mode without the data header and tail. However, the current hardwareacceleration library supports only the mode with the data header and tail. Therefore, theCeph source code needs to be modified to adapt to the Kunpeng hardware accelerationlibrary. For details about the modification method, see the latest patch that has beenincorporated into the mainline version:

https://github.com/ceph/ceph/pull/34852

The following uses Ceph 14.2.11 as an example to describe how Ceph adapts to the zlibcompression engine.

Step 1 Obtain the source code.

Source code download address: https://download.ceph.com/tarballs/

After the source code package is downloaded, save it to the /home directory onthe server.

Step 2 Obtain the patch and save it to the /home directory.

https://github.com/kunpengcompute/ceph/releases/download/v14.2.11/ceph-14.2.11-glz.patch

Step 3 Go to the /home directory, decompress the source code package and enter thedirectory generated after decompression.cd /home && tar -zxvf ceph-14.2.11.tar.gz && cd ceph-14.2.11/

Step 4 Apply the patch in the root directory of the source code.cd /home/ceph-14.2.11patch -p1 < ceph-14.2.11-glz.patch

Step 5 After modifying the source code, compile Ceph.

● CentOS: See Ceph 14.2.1 Porting Guide (CentOS 7.6).

● openEuler: See Ceph 14.2.8 Porting Guide (openEuler 20.03).

Step 6 Install Ceph.

Step 7 Modify the ceph.conf file to configure the zlib compression mode.vi /etc/ceph/ceph.confcompressor_zlib_winsize=15

Step 8 Restart the Ceph cluster for the configuration to take effect.ceph daemon osd.0 config show|grep compressor_zlib_winsize

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 113

Page 118: Tuning Guides - HUAWEI CLOUD

----End

Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 114

Page 119: Tuning Guides - HUAWEI CLOUD

A Change History

Date Description

2021-09-13 This issue is the tenth official release.Added 1 Using the Kunpeng Hyper Tuner for Tuning.

2021-07-14 This issue is the ninth official release.Added the adaption of Ceph storage tuning to openEuler20.03 LTS SP1.

2021-06-25 This issue is the eighth official release.Changed the processor model from "Kunpeng 920 5230" to"Kunpeng 920 5220."

2021-05-26 This issue is the seventh official release.● Changed zlib hardware acceleration to KAE zlib

compression.● Changed MD5 hardware acceleration to the KAE MD5

digest algorithm.

2021-03-23 This issue is the sixth official release.Changed the solution name from "Kunpeng SDS solution"to "Kunpeng BoostKit for SDS."

2021-01-19 This issue is the fifth official release.● Modified the Adapting Ceph to the Accelerator

operation procedure in the 3 Ceph Object StorageTuning Guide.

● Added the reference to the I/O Passthrough Tool UserGuide to the I/O passthrough tool tuning.

2020-09-27 This issue is the fourth official release.Added information about the I/O passthrough tool.

Kunpeng BoostKit for SDSTuning Guide A Change History

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 115

Page 120: Tuning Guides - HUAWEI CLOUD

Date Description

2020-06-29 This issue is the third official release.● Modified the Bcache-related reference.● Added 3.4.3 KAE MD5 Digest Algorithm Tuning in the

3 Ceph Object Storage Tuning Guide.

2020-05-09 This issue is the second official release.Modified figures in the documents.

2020-03-20 This issue is the first official release.

Kunpeng BoostKit for SDSTuning Guide A Change History

Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 116