capacity management in cisco hyperflex white paper

© 2020 Cisco and/or its affiliates. All rights reserved. of 56

Capacity Management in Cisco HyperFlex

White paper

Cisco public


Contents

Cisco HyperFlex HX Data Platform overview 3

Cisco HyperFlex HX Data Platform all-NVMe storage 3

Cisco HyperFlex architecture 3

Hyperflex replication factor 4

Data striping across nodes and drives in HyperFlex 4

Cisco Intersight 5

HyperFlex (HX) space views 5

Virtual machine space view 6

HX Connect space views 6

Additional space consumption 8

vCenter space views 9

Understanding HX cleaner and its process 10

Space accounting for delayed fashion writes and deletes 11

Best practices 12

Space accounting for snapshots 26

VMware vStorage API for Array Integration (VAAI) 36

Different cases of failure modes 38

Space usage’s effect on performance 43

Handling “Out of Space” errors 45

Cluster expansion 45

Summary 47

HyperFlex storage capacity forecasting and trending 54


Cisco HyperFlex HX Data Platform overview

Cisco HyperFlex™ HX Data Platform (HX Data Platform) is a hyperconverged software appliance that transforms

Cisco® servers into a single pool of compute and storage resources. It eliminates the need for network storage

and enables seamless interoperability between computing and storage in virtual environments. The HX Data

Platform provides a highly fault-tolerant distributed storage system that preserves data integrity and optimizes

performance for Virtual Machine (VM) storage workloads. In addition, native compression and deduplication

reduce storage space occupied by the VMs and VM workloads.

Cisco HyperFlex HX Data Platform all-NVMe storage

Cisco HyperFlex systems are designed with an end-to-end software-defined infrastructure that eliminates the

compromises found in first-generation products. With all–Non-Volatile Memory Express (NVMe) memory

storage configurations and a choice of management tools, Cisco HyperFlex systems deliver a tightly integrated

cluster that is up and running in less than an hour and that scales resources independently to closely match

your application requirements. An all-NVMe storage solution delivers more of what you need to propel mission-

critical workloads. For a simulated Oracle OLTP workload, it provides 71 percent more I/O Operations Per

Second (IOPS) and 37 percent lower latency than our previous-generation all-flash node. A holistic system

approach is used to integrate HX Data Platform software with Cisco HyperFlex HX220c M5 all NVMe Nodes.

The result is the first fully engineered hyperconverged appliance based on NVMe storage. All-NVMe solutions

support most latency-sensitive applications with the simplicity of hyperconvergence.

Cisco HyperFlex architecture

In Cisco HyperFlex Systems, the data platform spans three or more Cisco HyperFlex HX-Series nodes to create

a highly available cluster except for a two-node edge cluster. Each node includes a Cisco HyperFlex HX Data

Platform controller that implements the distributed file system using internal flash-based SSD drives for the all-

Flash cluster, internally based NVMe SSD drives for the all-NVMe cluster, and high-capacity HDDs for the

hybrid cluster to store data. The controllers communicate with each other over 10 Gigabit Ethernet to present a

single pool of storage that spans the nodes in the cluster (see Figure 1). Nodes access data through a data

layer using file, block, object, and API plug-ins. As nodes are added, the cluster scales linearly to deliver

computing, storage capacity, and I/O performance. This hyperconverged solution integrates servers, storage

systems, network resources, and storage software to provide an enterprise-scale environment for an

application deployment. This highly integrated environment provides reliability, high availability, scalability, and

performance to handle large-scale transactional workloads.

Note: Although the testing described here was performed using Cisco UCS 6200 Series Fabric

Interconnects, the Cisco HyperFlex HX Data Platform does include support for Cisco UCS 6300 Series

Fabric Interconnects, which provide higher performance with 40 Gigabit Ethernet.


Figure 1.

Overview of a Cisco HyperFlex architecture

Hyperflex replication factor

A Replication Factor (RF) is the number of copies that the Hyperflex System writes of the storage block. There

are two options: An RF of 2 or an RF of 3. The default setting during installation of the Hyperflex System is 3. All

the blocks are written with an algorithm so that none of the blocks are written on the same node or same disk.

This eliminates the single point of failure.

Data striping across nodes and drives in HyperFlex

The data platform implements a log-structured file system that uses a caching layer to accelerate read requests

and write responses, and it implements a capacity layer for persistent storage. The capacity layer is comprised

of HDD drives (in hybrid nodes), SSD drives (in all-flash nodes), or NVMe storage (in all-NVMe nodes).

Incoming data is striped across the number of nodes that you define to meet your data availability requirements.

The log-structured file system assembles blocks to be written to a configurable cache until the buffer is full or

workload conditions dictate that it be de-staged to the capacity layer. When existing data is (logically)

overwritten, the log-structured approach simply appends a new block and updates the metadata. When data is

de-staged, the write operation consists of a single seek operation with a large amount of data written. This

approach improves performance significantly compared to the traditional read-modify-write model, which is

characterized by numerous seek operations and small amounts of data written at a time.


Currently, write operations are compressed as it gets written to the Write Log (WL) and when data is de-staged

to the capacity layer in each node, the data is deduplicated. This process occurs after the write operation is

acknowledged, so there is no performance penalty for these operations. A small deduplication block size helps

increase the deduplication rate. Compression further reduces the data footprint. Data is then moved to the

capacity tier as cache segments become free. Read operations in hybrid nodes cache data on the SSD drives

and in main memory for high performance. In all-Flash and NVMe nodes they read directly from storage. Having

the most frequently used data stored in the caching layer helps make Cisco HyperFlex Systems perform well for

virtualized applications. When virtual machines modify data, the original block is likely read from the cache, so

there is often no need to read and then expand the data on a spinning disk. The data platform decouples the

caching tier from the capacity tier and allows independent scaling of I/O performance and storage capacity.

Cisco Intersight

Cisco Intersight™ is a software management platform delivered as a service with embedded analytics for your

Cisco and third-party IT infrastructure. This platform offers an intelligent level of management that enables IT

organizations to analyze, simplify, and automate their environments in more advanced ways than with prior

generations of tools. Cisco Intersight provides an integrated and intuitive management experience for resources

in the traditional data center and at the edge. With flexible deployment options to address complex security

needs, getting started with Intersight is quick and easy. Cisco Intersight has deep integration with Cisco UCS®

and HyperFlex systems, allowing for remote deployment, configuration, and ongoing maintenance. The model-

based deployment works for a single system in a remote location or with hundreds of systems in a data center,

and enables rapid, standardized configuration and deployment. It also streamlines maintaining those systems,

whether you are working with small or large configurations.

Cisco Intersight offers flexible deployment either as Software as a Service (SaaS) on Intersight.com or running

on your premises as Cisco Intersight Virtual Appliance. The virtual appliance provides the benefits of Cisco

Intersight, while allowing more flexibility for those with additional data locality and security requirements. To

learn more about Cisco Intersight features and supported systems, see the Cisco Intersight Getting Started

Guide.

Different views of space

HyperFlex (HX) space views

● Raw - This is the actual total physical capacity available for use. Per node, this equates to multiplying the

number of capacity drives by the drive’s capacity.

● Usable - This is the total available space after deducting RF and metadata overhead from the raw

capacity. Typically, the metadata requirement equates to 7% of capacity or 0.93 as a multiplier.

● Effective -This is essentially the determined Usable storage multiplied by the storage optimization gain

of compression and deduplication multiplied together.

● Storage Optimization Gain – This is best explained using an example. If the anticipated gain from

Compression was 30% that equates to 0.3 (30/100) and if the anticipated gain from Deduplication was

10% that equates to 0.1 (10/100). To calculate the overall gain, we use the following calculation: (1-

compression gain) *(1-deduplication gain) = (1-0.3) * (1-0.1) = 0.63.

https://intersight.com/help/getting_started#cisco_intersight_overview

https://intersight.com/help/getting_started#cisco_intersight_overview


● Storage Operating Threshold - This is the safe contingency or overhead of storage capacity needed to

ensure the system will run optimally during normal operating conditions.

◦ The threshold is typically set at 70 storage utilization, or 0.7 as a multiplier.

● The overall calculation to determine the effective storage per HX node is therefore:

◦ Raw (#Disks * Capacity of Disks) / Replication Factor (2 or 3) * 0.93 (Metadata) * 0.7 (Storage

Operating Threshold) / Storage Optimization Gain (0.63 in above example).

Virtual machine space view

● Provisioned - This is the actual logical size of the virtual machine, which is thin-provisioned.

● Used - This is the actual physical space used by the virtual machine (application level) before applying

any kind of savings such as deduplication and compression.

HX Connect space views

The HX Connect user interface provides a view of the HX storage cluster status, components, and features,

such as encryption and replication. In the HX Connect UI, there are multiple storage views, with summaries

about the space utilization in the HX cluster. The concept of capacity is applied to both datastores and storage

clusters. As storage subsystems require base-2 math, physical raw capacity values are measured in base-2

(GiB/TiB).

Note: Usually one source of confusion is that when the drive vendors say 1 TB, what they mean is one

terabyte is equal to 10^12, or 1,000,000,000,000 bytes but the storage subsystems require base-2

math. If you convert TB to TiB, there will be about a 10-percent difference between the size of a

tebibyte and a terabyte. This is significant when talking about storage capacity because what storage

subsystems are looking for is 1 TiB, meaning one tebibyte is equal to 2^40, or 1,099,511,627,776 bytes.

A tebibyte equals nearly 1.1 TB.


1. Hx Connect dashboard overview

The HX Connect dashboard page displays a status summary of your HX storage cluster. This is the first page

that you see when you log in to Cisco HyperFlex Connect. The display units are labelled as TB, but they are

really measured as TiB to match reporting in vCenter. Figure 1 provides a view of the HX Connect dashboard.

HX Connect dashboard view:

The Capacity section in the HX Connect dashboard page displays a breakdown of the total storage versus how

much storage is used or free. It also displays the storage optimization, compression savings, and deduplication

percentages based on the data stored in the cluster. Capacity, used space and storage optimization are defined

as:

● Capacity (usable space) - The capacity value displayed in the view in Figure 1 isn’t the raw physical

capacity. The Capacity section in the HX Connect dashboard page provides the overall usable space that

is available after all the deductibles, which includes unit conversions overhead (9%), metadata overhead

(8%) and resiliency (space consumed for data protection depends on the replication factor that is set).

This is (raw – (conversion overhead) - (RF overhead) – (metadata overhead)).

● Used space - This is the total actual physical space used *after* deduplication and compression. An

important factor to be noted is that this counter includes additional metadata and “garbage” data that is

cleaned in a delayed fashion. Also, the capacity counter is updated in a delayed fashion as the data gets

flushed from the write log to the persistent tier. In order to better understand how this delayed fashion

works, additional details about write log, persistent tier, and how the data is being flushed are explained

in later sections of this guide.

● Storage optimization – This includes overall space efficiency, which is calculated as (1 - (1-%dedupe

savings) * (1 - %Compression savings)).

2. Hx Connect datastore overview

HX Connect datastore view:

● Size – This is the logical size of the datastore, which is thin-provisioned.

● Used – This is the used space shown in the datastore view above and is the actual physical space used *after* applying inline compression and deduplication. The capacity counter is updated in a delayed

fashion as the data gets flushed from the write log to the persistent tier.


Note: As the provisioned/used/free capacity of each datastore is based on datastore (thin) provisioned

capacity, the provisioned size of a datastore (specified by the administrator while creating the datastore)

can be higher than the actual storage size.

3. Hx Connect virtual machine overview

HX Connect virtual machine view:

● Storage provisioned – This is the logical size of the virtual machine (shown in the screen capture

above), which is thin-provisioned. However, the virtual machine can also be thick-provisioned.

● Storage used - The storage used displayed in the virtual machine view image above is the actual

physical space used by the virtual machine *after* inline compression. The capacity counter is updated in

a delayed fashion as the data gets flushed from the write log to the persistent tier.

Note: In the case of a VM being thick-provisioned, the storage used would be displayed as the same

amount as the storage provisioned. In the image above, the storage used will be 50 GB if the VM is thick-

provisioned.

Additional space consumption

The “used” space shown in the HX Connect dashboard (Capacity section) page includes additional metadata

and *garbage*. Due to this extra space consumption, the used space displayed in the HX Connect datastore

and virtual machine view doesn’t match with what the overall cluster capacity view shows.

● Datastore view ∑used space = virtual machine view ∑used space

● Cluster view ∑used space (includes garbage) ≠ datastore view ∑used space and virtual machine

view ∑used space

A Log-Structured File System (LFS) gives a completely different approach to managing a file system. The

central idea behind LFS is that blocks are never modified; whenever an operation conceptually modifies a file,

the operation places a new block at the end of the log and the old version of the data block becomes “garbage”

and then the file system needs to go back and perform garbage cleaning. Writes always go to the end of the

disk. The reason behind choosing LFS over write-in-place file systems—even though it generates additional

garbage—is that LFS is beneficial for snapshots and backups (as they are inexpensive). A major disadvantage

with write-in-place file systems is that performance degrades dramatically. Therefore, the cluster’s used space

in the dashboard capacity view includes garbage generated by LFS.


vCenter space views

1. Virtual machine overview

vCenter virtual machine view:

● Provisioned space – This is the logical size of the virtual ma

● chine, which is thin-provisioned.

● Used space – This is the used space shown in the virtual machine view directly above, and is the actual

physical space used *after* inline compression.

Note: If the virtual machine is thick-provisioned, the values displayed in the VMware vCenter and HX

Connect view differs. From the screen capture above, if the VM is thick-provisioned, the used space will

be displayed as 50 GB (the same as the provisioned space). The capacity-related values displayed in the

HX Connect view match with what the vCenter view displays.

2. Datastore overview

VMware vCenter datastore view:


● Capacity – This is the logical size of the datastore, which is by default, thin-provisioned.

● Used - This is the actual physical space used after inline deduplication and compression.

● Free – This is the difference between the capacity (thin-provisioned capacity) and used space of the

datastore.

Understanding HX cleaner and its process

The cleaner typically runs in the background continuously. The cleaner goes into sleep mode when it is not

needed and wakes when policy-defined conditions are met. These conditions include:

● Number of flushes since the last cleaner run

● Amount of data written since the last cleaner run

● Amount of data deleted since the last cleaner run

● Storage cluster space usage (see capacity status)

Priority levels:

● Normal priority - The cleaner generates minimum I/O

● High priority - The cleaner I/O is increased. This kicks in when the capacity thresholds are near to 76

percent

If your storage cluster is experiencing an ENOSPC condition, the cleaner automatically runs at high priority.

Priority is determined by cluster space usage. If the storage cluster reaches an ENOSPC WARN condition, the

cleaner increases its intensity by increasing the number of I/O to collect garbage. With an ENOSPC set

condition, it runs at highest priority.


Note: Deleting data alone does not trigger the cleaner to run and recover space. The properties and

settings govern the cleaner to trigger. The cleaner in the cluster runs aggressively after the cluster

utilization is above 76 percent. Running on a log-structured file system results in some differences

occurring in the datastore level and cluster level utilization due to dead data not being cleaned up. Until

then, the cleaner reclaims the space based on certain policy thresholds. In a case where you have deleted

a substantial chunk of an object that had a substantial size, the cleaner kicks off and will start proactively

cleaning data, but the effect is delayed.

Space accounting for delayed fashion writes and deletes

The space views shown in the HX Connect UI are from the persistent tier storage. From the HX standpoint, it’s

basically a delayed fashion model that has a write log (specially in all-Flash mode, the size is substantial—about

60 G per node) and persistent tier. Whenever there is a write/delete operation, it first sits in the write log and

doesn’t enter the persistent tier until the complete space in the write log is occupied. Therefore, you will not be

see the usable space go down for a long period of time when there is any deletion, and similarly, the used

space will not take immediate effect when there is some data written to the HX storage.

Example: In a 4-node HX cluster (all-Flash), the total write log size would be 240 GB. Consider a virtual

machine with provisioned space of 1 TB, but the actual used space is only 200 GB (meaning 200 GB of data is

written to the write log). HX counters are never updated until the write log space exceeds it maximum capacity,

so it sits in the persistent storage. Similarly, for deletion, the usable space is not going to update until there is

some data being deleted from the persistent storage. Also, for deletion, the cleaner will not kick off if the

deleted amount is tiny; if there are not a lot of writes happening in the system; and if the space utilization is low

enough.

Note: Usually, in a steady state, the flushes will happen periodically and don’t create a problem. Issues

are mostly observed during POCs, where very small amounts of data are written.


Best practices

VM periodic cleanup of space

As a best practice, weekly guest level cleanup is suggested by setting up a CRON job. Why? Since there is an

ext4 or NTFS file system running on the guest VM, and the VM Disk (VMDK) presented from HX Data Platform is

presented as a device/disk to the VM, if there is any file deletion happening from the application standpoint, the

used space from the virtual machine view becomes “0”, whereas the file system still thinks that it is talking to a

drive and the drive doesn’t know if there are any free blocks. Due to this situation, there is never a call coming

to HX to mention that the deleted space is freed up. So, from the HX standpoint the usage still shows the same

value as before, but from the application standpoint, the used space is “0”. It’s a very well-known problem in

the virtualization space. Fortunately, there is a solution to overcome this issue.

As HX is doing inline deduplication and compression, in the file system you can use the dd command with

dev/zero to just write zeros and create a very large file. This forces the New Technology File System (NTFS) to

use the free blocks and write zeros (basically allocating the free space that only the file system knows about

and using them to write zeroes). When an application writes zeroes, it uses those free blocks and send those

writes to HX. As all of them are zeroes, the effective used space will not change, but the actual physical space

changes (the cluster’s used space after deduplication and compression savings) because there will no longer

be any savings as it is now all zeroes; the space becomes 100 percent free. The result is that it has the effect of

deleting space on your system.

Reclaim space within a VM on a HyperFlex cluster

HyperFlex does not have a mechanism for reclaiming space within a thinly provision VM that is freed up. Thin

provisioning allows for on demand growth of a virtual disk. This allows for more efficient use of space from the

storage system compared to thick provisioning. When files on a thinly provisioned virtual drive are deleted, the

disk size is not automatically reduced. The reason for this is that the operating system only deletes the indexes

from the file system table. This results in the appearance of the quick deletion of files but no reclamation of

space. Reclaiming disk space only works when the blocks on your virtual disk are empty. Deleting data usually

only removes the entries from the file allocation table but it does not zero out the blocks. As a result, ESX will

still think the blocks are in use.

● Reclaim space within a Windows VM on a HyperFlex cluster

The following resource provides some options in reclaiming space on thinly provisioned virtual drives.

Use the tools outlined in this section to zero the blocks of data you want deleted. Before performing any

operation that will zero out blocks, ensure you create backup files. Also, make sure there are no existing

snapshots on the virtual machine in which you perform the steps below. This includes SENTINAL

snapshots. Removal of the SENTINAL snapshot will stun the VM for a prolonged period.

Reclaiming Space on Thinly Provisioned Disks on Windows VMs.

SDelete, a free utility from Microsoft's Sysinternals Suite, will be used for reclaiming space.

Download link: https://docs.microsoft.com/en-us/sysinternals/downloads/sdelete

VM VMDK side before running SDelete:

https://docs.microsoft.com/en-us/sysinternals/downloads/sdelete


1. Download the file to the target VM.

2. Once you have downloaded SDelete to the target VM, unzip the file.


3. Open a command prompt as an administrator.

4. Navigate to the directory where you uncompressed the SDelete file.

5. Run the following command: sdelete -z <drive letter>

6. Select “agree”.


VM VMDK size after sDelete operation:


Reclaim space within a Linux VM on a HyperFlex cluster:

You can use the dd utility to zero out the areas where the deleted data resides.

1. Use df -h to check the free space on the Linus OS.

2. Use the following command to zero out the empty space:

dd bs=1M count=90112 if=/dev/zero of=zeroa

3. In the previous command above, we are zeroing out 90 GB of space. After this completes, the

VMDK file will grow.


Size of VMDK in vCenter:

To reclaim the space enter the following command:

/$ sudo rm zero

[sudo] password for user: <---Enter password for user


VM VMDK size in vCenter after deletion of the zero file:

Space utilization thresholds

A best practice of storage cluster space utilization is to not go beyond 76 percent at the HX Connect capacity

view. Beyond 76 percent, usage at the HX Connect capacity view results in:

● Performance degradation

● Poor resiliency status. Resiliency status is the ability of the storage clusters to tolerate node failures

within the storage cluster. It describes how well the storage cluster can handle disruptions.

If you experience any drive/node failures (cluster resiliency degrades), maintaining the 70-percent threshold will

help the cluster resiliency come back to a normal state by recreating the copies (stored on the failed drive) onto

the other drives. It can only do this refill/recreation if there is free space left in the cluster. Maintaining capacity

usage at 70 percent provides better performance, availability, rebalancing benefits, and future headroom for

growth.

Note: The automatic healing time for any drive failure is one minute and the wait time for any node failure

is two hours.


Cluster resiliency views at different cluster space utilizations

● Normal space usage

If the cluster space utilization is less than 76 percent at the HX Connect capacity view, then the storage

cluster space usage is normal. The screen capture below provides a snip when cluster storage utilization

is normal.

Capacity view of the resiliency normal health status at 76 percent in HX Connect:

● Warn state

If the cluster space utilization is above 76 percent at the HX Connect capacity view, then the storage

cluster space utilization is above the warn threshold. The screen capture below shows what you can

expect to see when the resiliency health status is at 80 percent in the HX Connect capacity view.


Capacity view of the resiliency normal health status at 80 percent in HX Connect:


HX Connect Alarms view when storage capacity is above the warn state (higher than 76%):

Intersight Alarms view when storage capacity is above the warn state (higher than 76%):


vCenter Alarms view when storage capacity is above the warn state (higher than 76%):

● Read-only state

A cluster above 99 percent of capacity usage will revert to a read-only operational state. In this state,

the cluster doesn’t accept any further writes until you free up some space. Being in this phase leads to

poor resiliency. If there are any disk/node failures, there is no additional space to recreate the copies on

other nodes.

Capacity view of the resiliency health status at 99 percent or greater in HX Connect:


Intersight Alarms view when a cluster is in a “read only state” (higher than 99% capacity):

vCenter Alarms view when a cluster is in a “read-only state” (higher than 99% capacity):


HX Connect System overview view when a cluster is in a “read-only state” (higher than 99% capacity):

At 92 percent of a cluster’s space usage, the cluster will return to an operational state. The operational status

changes from read-only mode to online but the resiliency status will still be in a warning state, as shown below.

HX Connect dashboard view when a cluster returns to an operational state but is still in a warning state:

How to handle misbehaving drives

A failure of a capacity disk on a node will cause the disk to go offline and will be marked as blacklisted. Multiple

disks in the same node can fail without impact to the node or the cluster with the appropriate RF setting. A

rebalance after a disk failure can take several minutes up to two hours, depending on the failed disk size and

the amount of space consumed on it. Small additional latencies may be observable during this time. After one

or more disk failures, the system will start the rebalancing process. After the rebalance, the system is marked

as healthy. Once the system is healthy again, the failing disk(s) can be replaced. The specific process to handle

a misbehaving disk is by blacklisting it. Usually if there is a disk failure, try removing and re-adding the disk five

times (which software handles) and if it still fails, it does a hard blacklist where the cluster will stop reselecting

the disk and consider the drive gone permanently. Later, users can always opt to replace the blacklisted drive

with a new drive.


System information page after hard blacklisting drive:

vCenter Alarms view when the disk is blacklisted:

Space accounting for snapshots

A snapshot captures the entire state of the virtual machine at the time you take the snapshot. This includes:

● Memory state - the contents of the virtual machine’s memory

● Settings state - the virtual machine settings

● Disk state - the state of all the virtual machine’s disks

Snapshots operate on individual virtual machines. A snapshot preserves only the active virtual machine’s state.

When you revert to a snapshot, you return the virtual machine’s memory, settings, and virtual disks to the state

they were in when you took the snapshot. Storage can be an issue when using VMware snapshots. The size of

a snapshot file can never exceed the size of the original disk file. Any time you change a disk block, the system

creates the changes to the snapshot in the delta file and updates it as changes are made. If you change every

single disk block on your server after taking a snapshot, your snapshot would still be the same size as your

original disk file.

There is some additional overhead disk space that contains information that can be used to manage the

snapshots. The maximum overhead disk space varies. Traditional overhead estimates are based on the Virtual

Machine File System (VMFS) block size using VMFS-3.

Besides the performance impact, there Is also the potential for vCenter to not be aware of snapshots due to an

out-of-sync condition, also called an inconsistent VM snapshot or broken VM snapshot. In this case, it is

possible that a virtual machine, which has snapshots and performs many disk writes, could have a snapshot that

fills the datastore and causes all virtual machines on that datastore to go down until space is freed. So, you can

end up with a massive VM snapshot size eating up your production space.

Any data that was writable on a VM becomes read-only when the snapshot is taken. VMware administrators

can take multiple snapshots of a VM to create multiple possible point-in-time restore points. When a VM reverts

to a snapshot, current disk and memory states are deleted and the snapshot becomes the new parent snapshot

for that VM. The snapshot file cannot exceed the size of the original disk file, and it requires some overhead

disk space. Snapshots will grow rapidly with high disk-write activity volume. VMware recommends deleting

snapshots within 24 – 72 hours.


Snapshot file formats include *--delta.vmdk file, *.vmsd file and *.vmsn file. Administrators create snapshots in

VMware vSphere's Snapshot Manager or with the vmware-cmd command line utility. Deleting, or committing,

snapshots merges all the delta files into the VMDK. If delta files remain in the VM's directory after deletion, the

snapshot did not delete properly.

Note: Release 4.5.x of HyperFlex Data Platform Release 4.5.x requires changes in the snapshot

procedure. The content provided in this document supports the HyperFlex Data Platform 4.0.2 release.

Check snapshot usage

For example, let’s take a VM titled, “test123highHxBenchVm1”. It has two hard disks and we’ve created a

snapshot called snap1 for the VM as shown below.

Snapshot example:


In the following image, the highlighted disks in yellow are the delta VMDKs created by the snapshot (snap1) for

the VMDKs of the original VM, “test123highHxBenchVm1”. The highlighted file in blue is the snapshot overhead

file.

Snapshot space usage from vCenter datastore view:

The snapshot space usage from the ESXi host view is provided in the screen capture below. (It provides

additional stats such as physical shared and unshared bytes).

Snapshot space usage from the ESXi host view:

vmkfstools –extendedstatinfo <vm_name/snapshot_vmdk_name>

Space usage with multiple snapshots of a VM

When you create a snapshot, the state of the virtual disk at the time the snapshot is taken is preserved and

writes to the VMDK file are not allowed. The system creates an additional VMDK file called a delta disk for

each VMDK disk in the VM and allows you to write any changes to that delta disk. The delta disk represents the

difference between the current state of the virtual disk and the state that existed at the time the previous

snapshot was taken.

If multiple snapshots are taken, delta disks will be created for each of VM disks of every snapshot and can

represent the difference between each snapshot.


When you delete a snapshot, changes between snapshots and the previous disk states are merged and all the

data from the delta disk that contains the information about the deleted snapshot is written to the parent VMDK

disk. The amount of time it takes to commit or delete snapshots depends on how much data the guest

operating system has written to the virtual disks since the last snapshot was taken.

For example, let’s take the same VM we used for explaining the snapshot view and create another snapshot,

called snap2, as shown in the first screen capture below. The next two screen captures show the multiple

snapshots’ space usage within a VMware vCenter datastore and VMware ESXi, respectively.

snap2:


Multiple snapshots space usage from the vCenter datastore view:

Snapshot space usage from the ESXi host view:


Native snapshots

A native snapshot is a reproduction of a VM that includes the state of the data on all VM disks and the VM

power state (on, off, or suspended) at the time the native snapshot is taken. Take a native snapshot to save the

current state of the VM, so that you can revert to the saved state. Always use the HX Data Platform Snapshot

feature to create your first snapshot of a VM. This ensures that all subsequent snapshots are in native format.

When you create the first snapshot of a VM—through either Snapshot Now or a Scheduled Snapshot—the HX

Data Platform plugin-in creates a base snapshot called a SENTINEL snapshot. The SENTINEL snapshot ensures

follow-on snapshots are all native snapshots. The SENTINEL snapshot consumes one snapshot of the total 31

snapshots that are available per a VMware limitation.

HX Data Platform native snapshot options include create a native snapshot; revert to any native snapshot; and

delete a native snapshot.

1. Create a native snapshot

Consider a test VM named "Core_perf_vmHxBenchVm12" in the vCenter that has two hard disks. The available

disks and their capacity usage before taking a snapshot are shown in the following snip:

Create native snapshot using stcli commands, as shown in this next snip:

vCenter view after creating a snapshot:


The following information shows the two snapshots that were created after initiating the snapshot creation

command.

Snapshot1 – SENITNEL

Snapshot2 – snap1234

The following information shows the space usage details of the base disks after snapshot creation.

Note: Base disks get renamed after snapshot creation.

SENITNEL Snapshot

SENTINEL snapshots prevent reverted VMs from having VMware redo log-based virtual disks. Redo log-based

virtual disks occur when an original snapshot is deleted and the VM is reverted to the second oldest snapshot.

SENTINEL snapshots are in addition to the revertible native snapshot.

Note: SENTINEL snapshot creation isn’t supported with the 4.5.x Release of HX Data Platform when used

with specific ESXi versions.

When using SENTINEL snapshots:

● Do not delete the SENTINEL snapshot

● Do not revert your VM to the SENTINEL snapshot

Space usage of SENTINEL snapshot:


vCenter view of SENTINEL snapshot:

Space usage view of Sentinel snapshot “snap1234”

vCenter view of HX native snapshot “snap1234”:


2. Revert a snapshot

Reverting a snapshot is returning a VM to a state stored in a snapshot. Reverting a snapshot is performed on

one VM at a time. This is not performed at the resource pool or VM folder level. Reversion of snapshots is

performed through the vCenter Snapshot Manager and not through the HX Data Platform plug-in.

3. Delete a snapshot

Deleting snapshots is managed through the vSphere interface and not through the HX Data Platform plug-in.

Differences between a VM snapshot and HX native snapshot:

● Snapshots created from a VM ware are non-native snapshots, while snapshots created from the HX Data

Platform plug-in are native snapshots.

● Currently, VMware has a limitation of 32 snapshots per VM. This maximum total includes VMware-

created snapshots, a HX Data Platform SENTINEL snapshot, and HX Data Platform native snapshots.

● Native snapshots have a limitation of 31 snapshots per VM since the first native snapshot created for a

VM creates two snapshots, including a SENTINEL snapshot.

● The HX Data Platform native snapshot process is fast because it does not incur I/O overhead.

● HX Data Platform native snapshots do not degrade VM performance.

● HX Data Platform native snapshots are crash-consistent by default (I/O crash).

● Consistency is defined as maintaining the correct order of write operations to enable an application to

restart properly from a crash.

The Cisco HyperFlex HX Data Platform uses metadata-based, zero-copy snapshots to facilitate backup

operations and remote replication—critical capabilities in enterprises that require always-on data availability.

Space-efficient snapshots allow you to perform frequent online backups of data without needing to worry about

the consumption of physical storage capacity. Data can be moved offline or restored from these snapshots

instantaneously.

● Fast snapshot updates – When modified data is contained in a snapshot, it is written to a new location

and the metadata is updated, without the need for read-modify-write operations.

● Rapid snapshot deletions – You can quickly delete snapshots. The platform simply deletes a small

amount of metadata that is located on an SSD, rather than performing a long consolidation process as

needed by solutions that use a delta-disk technique.

● Highly specific snapshots – With the Cisco HyperFlex Data Platform, offloaded snapshots occur at a VM

level of granularity. You are not forced to take a snapshot of the entire volume or datastore to capture a

single VM.

Clones

A clone is a copy of an existing virtual machine. The existing virtual machine is called the parent of the clone.

When the cloning operation is complete, the clone is a separate virtual machine, although it may share virtual

disks with the parent virtual machine.

● Changes made to a clone do not affect the parent virtual machine. Changes made to the parent virtual

machine do not appear in a clone.

● A clone's MAC address and Universally Unique Identifier (UUID) are different from those of the parent

virtual machine.


If you want to save the current state of the virtual machine so you can revert to that state in case you make a

mistake, take a snapshot. If you want to make a copy of a virtual machine for separate use, create a clone.

Full and linked clones

There are two types of clones:

● A full clone is an independent copy of a virtual machine that shares nothing with the parent virtual

machine after the cloning operation. Ongoing operation of a full clone is entirely separate from the parent

virtual machine.

● A linked clone is a copy of a virtual machine that shares virtual disks with the parent virtual machine in an

ongoing manner. This conserves disk space and allows multiple virtual machines to use the same

software installation.

HyperFlex clones

In the Cisco HyperFlex HX Data Platform, clones are writable snapshots that can be used to rapidly provision

items such as virtual desktops and applications for test and development environments. These fast, space-

efficient clones rapidly replicate storage volumes so that virtual machines can be replicated through just

metadata operations, with actual data copying performed only for write operations. With this approach,

hundreds of clones can be created and deleted in minutes. Compared to full-copy methods, this approach can

save a significant amount of time, increase IT agility, and improve IT productivity.

Clones are deduplicated when they are created. When clones start diverging from one another, data that is

common between them is shared, with only unique data occupying new storage space. The deduplication

engine eliminates data duplicates in the diverged clones to further reduce the clone’s storage footprint. As a

result, you can deploy many application environments without needing to worry about storage capacity use.

If you use the VMware cloning operation, you can create only a single clone from a VM. This operation is

manual and slower than batch processing multiple clones from a VM. For example, to create 20 clones of a VM,

you must manually perform the clone operation repeatedly. Use HX Data Platform Ready Clones to create

multiple clones of a VM in one click.

When a native clone of a VM is made, the HX Data Platform simply copies the set of pointers. It takes no extra

disk space (apart from the size of the set of pointers) and is inherently de-duplicated.

Clone a VM using stcli:

Space usage of clone VM:


VMware vStorage API for Array Integration (VAAI)

VAAI, also called hardware acceleration or hardware offload APIs, is a set of APIs that enable communication

between VMware vSphere ESXi hosts and storage devices. This feature enables your host to offload specific

virtual machine and storage management operations to compliant storage hardware. With the storage hardware

assistance, your host performs these operations faster and consumes less CPU, memory and storage fabric

bandwidth.

This storage offload API allows vSphere to request advanced file system operations such as snapshots and

cloning. The controller causes these operations to occur through manipulation of metadata rather than actual

data copying, providing rapid response, and thus rapid deployment of new application environments.

Without the use of VAAI, cloning or migration of virtual machines by the vSphere VM Kernel Data Mover involves

software data movement. The Data Mover issues I/O to read and write blocks to and from the source and

destination datastores. With VAAI, the Data Mover can use the API primitives to offload operations to the array,

if possible. For example, if the desired operation were to copy a Virtual Machine Disk (VMDK) file from one

datastore to another inside the same array, the array would be directed to make the copy completely inside the

array. Whenever a data movement operation is invoked and the corresponding hardware offload operation is

enabled, the Data Mover will first attempt to use the hardware offload. If the hardware offload operation fails,

the Data Mover reverts to the traditional software method of data movement.

VAAI uses these fundamental operations:

● Atomic Test and Set (ATS), which is used during creation and locking of files on the VMFS volume

● Clone Blocks/Full Copy/XCOPY, which is used to copy or migrate data within the same physical array

● Zero Blocks/Write Same, which is used to zero out disk regions

● Thin-provisioning in ESXi 5.x and later hosts, which allows the ESXi host to tell the array when the space

previously occupied by a virtual machine (whether it is deleted or migrated to another datastore) can be

reclaimed on thin provisioned LUNs

● Block Delete in ESXi 5.x and later hosts, which allows for space to be reclaimed using the SCSI UNMAP

feature

Command to check VAAI status:


Command for NFS datastores to check whether VAAI is supported:

vSphere view of VAAI status:

For each storage device and datastore, the vSphere client displays the hardware acceleration support status in

the Hardware Acceleration column of the Devices view and the Datastores view.

The status values are “Unknown”, “Supported”, and “Not Supported”. The initial value is “Unknown”. The status

changes to “Supported” after the host successfully performs the offload basic operations. If the offload

operation fails, the status changes to “Not Supported”.

ESXI command-line interface view of VAAI status:


VAAI limitations

VAAI cannot be used when:

● The source and destination VMFS volumes have different block sizes

● The source file type is RDM and the destination file type is non-RDM (regular file)

● The source VMDK type is Thick Eager Zeroed and the destination VMDK type is thin

● The source or destination VMDK is any kind of sparse or hosted format

● Cloning a virtual machine that has snapshots, because this process involves consolidating the snapshots

into the virtual disks of the target virtual machine

● The logical address and/or transfer length in the requested operation is not aligned to the minimum

alignment required by the storage device (all datastores created with the vSphere Client are aligned

automatically)

● The VMFS datastore has multiple LUNs/extents spread across different arrays

Different cases of failure modes

Case 1: Node failure

The behavior of a Cisco HyperFlex cluster during a node failure depends on three factors:

● The overall HyperFlex cluster’s size (number of nodes) during healthy operations

● The Replication Factor (RF) that has been configured for the cluster

● The number of (additional) nodes that have failed simultaneously

There will be a noticeable impact to the cluster due to latency spike for workloads when the node failure

occurs. The surviving nodes may be computationally impacted as the VMs from the failed node are restarted.

Capacity utilization in general will rise by 1/n per node once rebuild completes (n=total number of original

nodes). After a certain time, depending on the available storage resources, the cluster will start rebalancing the

RF2 / RF3 protection. This time is called the healing time. By default, the healing time is set to two hours. If

within this time the nodes come back online, the cluster will start updating the storage layer, depending on the

number of updates written to the other copies of the data. For more information, read the following full

explanation of the different scenarios.

A storage cluster healing timeout is the length of time the HX Connect or HX Data Platform plug-in waits before

automatically healing the storage cluster. If a disk fails, the healing timeout is one minute. If a node fails, the

healing timeout is two hours. A node failure timeout takes priority if a disk and a node fail at same time or if a

disk fails after node failure, but before the healing is finished.

When the cluster resiliency status is “Warning”, the HX Data Platform system supports the storage cluster

failures and responses for RF3 outlined in the bulleted list below. Optionally, click the associated Cluster

Status/Operational Status or Resiliency Status/Resiliency Health in HX Connect and HX Data Platform plug-in to

display the reason messages that explain what is contributing to the current state.

● Three-node cluster with one simultaneous failure – The storage cluster does not automatically heal.

However, there is a Cisco Technical Assistance Center (TAC) procedure to replace the failed node to

restore storage cluster health.


● Four-node cluster with one simultaneous failure – If the node does not recover in two hours, the

storage cluster starts healing by rebalancing data on the remaining nodes. To recover the failed node

immediately and fully restore the storage cluster:

◦ Check that the node is powered on and restart if possible. You might need to replace the node.

◦ Rebalance the cluster.

● A cluster of five or more nodes with two simultaneous failures – If the node does not recover in two

hours, the storage cluster starts healing by rebalancing data on the remaining nodes. To recover the

failed node immediately and fully restore the storage cluster:

◦ Check that the node is powered on and restart if possible. You might need to replace the node.

◦ Rebalance the cluster.

Various impacts during a node failure:

● Potential impacts on the VM (all components) located on the node failure – Depending on the cluster

configuration’s VMware settings of high availability and DRS, the running VMs on the failing node need to

be, or are, automatically restarted on another node. In a normal production environment, the cluster has

an N+1 configuration. This means that there is one node with additional resources (spare) in case one

node fails. Data is still available since there is no data locality. Data storage VMotion is not needed. This

is due to the usage of RF2 or RF3 (replication factor of the data platform).

● Potential impacts on the VM (all components) located on a different node – The VMs that were

running on the failing node can be restarted on the remaining nodes; however this depends on the

requirements settings of high availability and DRS. It also depends on how many resources are available

on the remaining running nodes. At the time of design, it must be considered how many resources are

available for failover in case of failures. In a normal production environment, the cluster has an N+1

configuration. This means that there is one node with additional resources (spare) in case one node fails.

● System stabilization – After the failing nodes have been replaced/repaired, the system is rebalanced

back to its original state. This means that the VMs can be transferred online to the previously failing

nodes without interruption of the operation. The data will be rebalanced in the background. The time this

will take is dependent on the size of the underlaying storage layer, the number of updates of the data,

the CPU type, and memory size.

HX Connect dashbaord page when there is a node failure:


System information page after a node failure:

Case 2: Cache disk failure

After a cache SSD fails, the cluster remains highly available. The VMs will keep on running since the Hyperflex

storage controller VM is still running and will get the data from and to the HX Data Platform in conjunction with

the other nodes and the residing Hyperflex storage controller VMs on those nodes. Data will always be

accessible.

Various impacts during a cache disk failure:

● Potential impacts on the VM (all components) hosted on the same node where the cache disk failed

– Expect a slight impact to latency for that specific node as the aggregate read cache will diminish until

the SSD is replaced and rebalanced for hybrid clusters. There is no read cache for the all-Flash cluster,

so there will be no impact to latency. The write cache will always remain online since the Hyperflex

storage controller VM will ‘talk’ to the other Hyperflex storage controller VMs residing on the other

nodes.

● Potential impacts on the VM (all components) hosted on a different node where cache disk failed:

No impact

● System stabilization – After replacing the cache disk, the cluster will rebalance the data.


Case 3: Capacity disk failure

A failure of a capacity disk on a node will cause the disk to go offline and be marked as blacklisted. Multiple

disks in the same node can fail without impact to the node and cluster with the appropriate RF setting:

● A cluster of three or more nodes with one simultaneous disk failure – If the disk does not recover in

one minute, the storage cluster starts healing by rebalancing data on the remaining nodes.

● A cluster of three or more nodes with two simultaneous disk failures – If the disk does not recover in

one minute, the storage cluster starts healing by rebalancing data. If the system is not restored, replace

the faulty disks and restore the system by rebalancing the cluster.

● A cluster of four or more nodes with two simultaneous disk failures – If the disk does not recover in

one minute, the storage cluster starts healing by rebalancing data on the remaining nodes (where the

disk failure didn’t happen).

● A cluster of five or more nodes with two simultaneous disk failures – The system automatically

triggers a rebalance after a minute to restore storage cluster health.

Note: For disk failures, with RF3, you can lose as many disks as you want in each node, as long as you do

not lose disks in three separate nodes at the same time. So, you can lose five disks in node 1 and 8 disks

in node 2 and still stay online. This assumes the nodes are still available and you are experiencing only disk

losses. The cluster will quickly heal, such that you can continue to sustain losses indefinitely (if free cluster

space remains).


Various impacts during capacity drive failure:

● Potential impacts on the VM (all components) hosted on the same node where capacity disk failed –

A rebalance after a disk failure can take several minutes to two hours, depending on the failed disk size

and the amount of space consumed on it. Small additional latencies may be observable during this time.

● Potential impacts on the VM (all components) hosted on a different node where capacity disk failed

– A slight increase of latency could be measured, although if it can be observed, the impact is extremely

low.

● System stabilization – After one or more disk failures, the system will start the rebalancing process.

After the rebalance, the system is marked as healthy. After the system is healthy again, the failing disk(s)

can be replaced.

System information page before disk failure:

System information page after disk failure:

From the previous snip, there are only five capacity disks shown out of a total of six drives.


Space usage’s effect on performance

Following are the performance benchmarking results at different cluster capacities (all-Flash cluster with RF set

to 3).

Performance metrics noticed when cluster capacity is at 70 percent (before entering a Warn state)

Application-level metrics:

Load Point 10% 50% 75% 90% 100%

IOPS 7,202 35,604 53,407 63,999 69,554

Latency (ms) 1.3 2.8 5.3 6.9 7.7

Throughput (MBps) 56.3 278.2 417.2 500.0 543.4

HX Connect performance dashboard:

Performance metrics noticed when cluster capacity is at 80 percent (above a Warn state of 76 percent)

Application-level metrics:

Load Point 10% 50% 75% 90% 100%

IOPS 6,201 30,902 46,304 55,492 55,145

Latency (ms) 2.7 4.1 7.5 9.2 10.1

Throughput (MBps) 48.4 241.4 361.7 433.5 430.8


HX Connect performance dashboard:

From these test results, there is clearly a drop in the performance as the cluster capacity increases. However,

the actual performance degradation starts when the cluster enters the Warn state, which can be noticed from

the results at different cluster capacities.

Note: In the following graph, the load lines moving to the top left corner are typical on any storage

system. This is not just a HyperFlex phenomenon.


Handling “Out of Space” errors

If the system displays an “Out of Space” error, such as the cluster is entering “Read-only mode”, you can either

add a node to increase free capacity or delete existing unused VMs to release space. When there is an “Out of

Space” condition, the VMs are unresponsive.

Note: Do not delete storage controller VMs. Storage controller VM names have the prefix stCtlVM.

Following are the steps to take to handle “Out of Space” issues:

● To add a node, use the Expand Cluster feature of the HX Data Platform Installer.

● To delete unused VMs:

◦ Determine which guest VMs you can delete. You can consider factors such as disk space used by the

VM or naming conventions.

◦ Go to vCenter > Virtual Machines to display the virtual machines in the inventory.

◦ Double-click a VM that you want to delete.

◦ Power off the VM.

◦ o Delete the VM.

● After the “Out of Space” condition is cleared:

◦ Go to vCenter > Virtual Machines to display the VM in the inventory.

◦ Double-click a VM that you want to use.

◦ Power on the VM.

Cluster expansion

When there is a need for additional storage in a cluster, the HX Data Platform supports an expansion feature to

increase storage capacity. You can add disks to the existing nodes or add new converge nodes to the existing

cluster.

Guidelines for expansion

● Replication should be paused before performing a cluster expansion; it can be resumed after completing

expansion.

● Use only admin credentials for the controller VM during the expansion workflow. Using any other

credentials other than admin may cause the expansion to fail.

● The ESXi installation is supported on SD cards for M4 converged nodes and M.2 SATA SSD for M5

converged nodes. For compute-only nodes, ESXi installation is supported for SD Cards, SAN boot, or

front SSD/HDD. Installing ESXi on a USB Flash is not supported for compute-only nodes.


Mixed cluster expansion guidelines

● Expanding an existing M4 cluster with M5 converged nodes is supported.

● Expanding an existing M5 cluster with M4 converged nodes is not supported.

● Expanding an existing mixed M4/M5 cluster with M4 or M5 converged nodes is supported.

● Adding any supported compute-only nodes is permitted with all M4, M5, and mixed M4/M5 clusters

using the HX Data Platform Release 2.6 or later Installer. Some example combinations are listed here;

many other combinations are possible.

Example combinations:

◦ Expand a mixed M4/M5 cluster with compute-only Cisco UCS® B200, C220, C240 M4/M5 servers

◦ Expand an M4 cluster with compute-only Cisco UCS B200 M5, C220 M5, C240 M5 servers

● Only an expansion workflow is supported to create a mixed cluster. An initial cluster creation with mixed

M4/M5 servers is not supported.

● All M5 servers must match the form factor (220/240), type (Hybrid/AF), security capability (Non-SED

only), and disk configuration (quantity, capacity, and non-SED) of the existing M4 servers.

● HyperFlex HX220c M5 nodes will use a maximum of six capacity disks (two disk slots will remain empty)

when mixed with HX220c M4 nodes.

● HX Edge, SED, LFF, Hyper-V, and stretched clusters do not support mixed M4 and M5 clusters.

Note: In a compute node addition, adding a compute-only node will not change the available cluster

capacity.

Cluster node expansion

A converged node can be added to a HyperFlex cluster after cluster creation. The storage on a converged node

is automatically added to the cluster's storage capacity. Multiple nodes can be added without waiting for re-

balancing. After adding a new node, the new storage capacity is immediately available for use. Rebalancing

may not start immediately every time after adding the node. The rebalancing initiation will depend upon the

cluster storage capacity usage.

Scenario: adding new nodes (cluster node expansion)

● The storage cluster should be healthy.

● Check the node interoperability rules (e.g., an M4 node can be expanded with M5 and not vice versa)

● Note: Some features, such as a stretched cluster, are not supported on M4 nodes.

● Clusters need to be homogenous:

◦ All nodes need to have the same number. An M5 node has additional slots. For a mixed cluster you

will not be able to take advantage of additional slots.

◦ All data drives need to be same capacity.

◦ All data drives need to be the same type (for instance, you cannot mix SED and non-SED or NVMe

and SATA).

◦ The M4 and M5 nodes have different drive sleds; they will not fit into other nodes.


◦ Cache drives must be compatible.

◦ Pay attention to the cluster version when using alternate drives. (More explanation on this topic will

be covered later in this guide.)

● Ensure that the new node uses the same configuration as the other nodes in the storage cluster. This

includes VLAN IDs and switch types (whether switches), VLAN tagging with External Switch VLAN

Tagging (EST), VLAN tagging with Virtual Switch Tagging (VST), or a virtual distributed switch.

(If the storage cluster is in an “Out of Space” condition, then when you add a new node, the system will

automatically rebalance the storage cluster. This is in addition to the rebalancing that is performed every

24 hours.)

● Ensure that the node you add is of the same model (HX220 or HX240), type (hybrid or all-Flash), and

disk configuration (SED or non-SED). In addition, ensure that the number of capacity disks matches the

existing cluster nodes.

● To add a node that has a different CPU family from what is already in use in the HyperFlex cluster,

enable EVC.

● Ensure that the software version on the node matches the Cisco HX Data Platform version, the ESXi

version, and the vCenter version. To identify the software version, go to the Storage Cluster Summary

tab in vCenter and check the HX Data Platform version in the top section. Upgrade if necessary.

(If you upgraded the cluster, you must download and install a new installer VM that matches the current

version of HyperFlex Data Platform running on the cluster.)

● Ensure that the new node has at least one valid DNS and Network Time Protocol (NTP) server

configured.

● If you are using Single Sign-On (SSO) or auto-support, ensure that the node is configured for SSO and

Simple Mail Transfer Protocol (SMTP) services.

● Allow Internet Control Message Protocol (ICMP) for pinging between the HX Data Platform installer and

the existing cluster’s management IP address.

● You can add multiple nodes without waiting for a rebalance.

● New capacity is available immediately. (The rebalance may not kick off immediately and will take time

when it does. Additional read performance will be available post-rebalance.)

Summary

● Assign the Compute-Only service profile to a server.

● Install the ESXi downloaded from www.cisco.com/go/software

● Make sure you can ping the vSphere host.

● Run the HyperFlex installer and choose the option, “Manual”.

● Check “Deploy HX Software and Expand Server”.

● Fill in the requested credentials and add a node with the vSphere IP address of the host.

● Wait until the expansion is finished.

http://www.cisco.com/go/software


Test the results for cluster node expansion

Cluster capacity before node expansion:

System information page before node expansion:

Cluster capacity after node expansion:

System information page after expansion:


System overview page after node expansion is complete:

During the rebalance operation:

Cluster capacity after the rebalance operation is complete:

Newly added node after rebalancing is complete:


Storage usage on the newly added node after rebalancing is complete:

Storage usage on other nodes after rebalancing is complete:


Performance snip during the complete node expansion experiment:

In the previous snip, the rebalance operation (highlighted with the black circles at the bottom of the image)

started around 5:15 p.m. and ended at 7:05 p.m.

Performance snip from start of the test (with three nodes) and a fourth node added to the cluster (before

the rebalance operation started):


Performance snip from the start of the rebalance operation through to its completion:

On comparing the performance results during the rebalance operation with the data before the rebalance

operation started (highlighted in the above screen captures), a slight impact on the performance during

rebalancing is happening on the cluster after adding an additional node to the existing cluster.

Performance snip after rebalancing is complete and end of the workload run:


Cluster data drive expansion

When a cluster is running out of space, another way to increase storage capacity is by adding more data drives

to the existing nodes in the cluster. Add data drives dynamically to all nodes without waiting for rebalance. All

data drives need to be of the same capacity in the cluster.

Scenario: Adding additional drives to existing clusters

● Add data drives dynamically - add drives to all nodes without waiting for rebalance

● Clusters need to be homogenous:

● All nodes need to have the same number

◦ All data drives need to be same capacity

◦ All data drives need to be the same type (e.g., you cannot mix SED and non-SED or NVMe and SATA

drives)

◦ M4 and M5 nodes have a different drive sled; they don’t fit into other nodes

◦ Pay attention to the cluster version when using alternate drives (more explanation on this issue is

covered later in this document)

● Adding more drives is not applicable for cache drives (we support only one per node)

● The additional capacity will be available immediately (rebalance may not kick off immediately and will

take time when it does)

Newly added drives in UX (4.0(2x) example)


HyperFlex storage capacity forecasting and trending

The Capacity Runway (preview) feature available in Cisco Intersight enables you to predict the storage

utilization for HyperFlex clusters. The Capacity Runway Details view displays the Storage Runway and Utilization

Trend. The details on this page enable you to monitor storage utilization in order to manage capacity planning. If

the storage utilization exceeds 76 percent, add additional drives or nodes to expand capacity. Consider

removing unused virtual machines and deleting files. In the Capacity Runway Details page, you can view the

following information:

● Storage runway – Displays the number of days predicted for the storage capacity to exceed the

recommended capacity of 76 percent in the next full year

● Storage capacity (TiB) – Displays the amount of storage capacity available in the cluster

● Storage utilization – Displays the amount of storage capacity utilized by the cluster as a percentage. The

storage utilization percentage is computed based on the total capacity and the current capacity

utilization

● Storage utilization trend – Displays the percentage increase or decrease in storage utilization in the last

six months

● Storage optimization – Provides a percentage representation of the efficient usage of the storage

infrastructure

● Utilization trend graph – Displays the current trend and forecast trend for the storage utilization. The

graph is color-coded to indicate the alarms raised for the thresholds. You can select utilization

thresholds of 76 percent or 100 percent to view the predicted storage utilization on the graph.

HyperFlex cluster capacity planning

Cisco Intersight enables you to predict the storage utilization for HyperFlex clusters. Using the historical data of

used storage capacity, along with the predicted storage utilization per HyperFlex cluster, you can proactively

scale the storage utilization on the HyperFlex clusters.

Following is a list of some important factors to consider:

● The Capacity Runway is calculated only for HyperFlex clusters claimed and connected to Intersight.

● Claim all the servers in the HyperFlex cluster with the proper license to enable the Capacity Runway.

● After a new HyperFlex cluster installation, it may take up to five minutes for the claimed cluster’s

licensing details to be refreshed. The HyperFlex Capacity Runway tab may temporarily show that that

license is not available.


● Capacity Runway is computed daily at 6:00 a.m. UTC. Historical data from the last six months is used for

the computation.

● A minimum of 14 days of data is required to begin predicting storage utilization.

● The accuracy of the predicted storage utilization increases with more recent historical data collected.

Alarms: Depending on the storage capacity utilization, alarms are raised up to six months in advance. A critical

alarm is raised when the storage utilization is predicted to exceed the recommended capacity limit of 76

percent in 60 days or if the capacity may exceed 100 percent in 90 days. A warning alarm is raised if the

storage utilization is predicted to exceed the recommended capacity limit of 76 percent in 180 days.

Notes:

● Capacity Runway is an Essentials feature.

● Th Capacity Runway feature is in preview and is not meant for use in your production environment. Cisco

recommends that you use this feature on a test network or system.

● HyperFlex cluster capacity planning is not supported on the Cisco Intersight Virtual Appliance.


Utilization trend:

Printed in USA C11-744026-00 10/20

capacity management in cisco hyperflex white paper

Documents